Page MenuHomePhabricator

Use Momento API for archived internet resources
Closed, ResolvedPublic

Description

weblib.py provides a way to obtain Wayback Machine and WebCite archive urls for Internet resources. It uses a custom and simple API for each service. These weblib methods are only used by scripts/weblinkchecker.py , which prefers Wayback over WebCite.

Wayback has had support for the Momento protocol for many years now. Using Momento for accessing archived versions of Internet resources allow multiple implementations using the same protocol, and the Internet resource can also provide their own Momento support (and MediaWiki has an extension which supports this).

http://ws-dl.blogspot.fr/2013/07/2013-07-15-wayback-machine-upgrades.html
http://mementoweb.org/

Our WebCite code also uses a custom API. I could not find a Momento API for WebCite.

weblib.py should provide plumbling for obtaining an archived version of a internet resource, and attempt to use the original resource or Momento providers like Wayback, and any non-Momento providers (WebCite?). It should have a default algorithm for picking the 'best' archived version, but allow callers to override this with a specific preference.

Event Timeline

jayvdb raised the priority of this task from to Needs Triage.
jayvdb updated the task description. (Show Details)
jayvdb added a project: Pywikibot.
jayvdb changed Security from none to None.
jayvdb added a subscriber: jayvdb.

I couldnt quickly find a Momento python library, but there is some link parsing code at https://bitbucket.org/azaroth42/linkheaderparser/src/c2321bf3349b94a12a37ed8c41d4e4785006ada7

As seen in T104761, the weblib API is problematic as it returns None for failure, rather than using exceptions to indicate failure.

A new package has arrived on the scene which may help.

https://github.com/pastpages/django-memento-framework
https://pypi.python.org/pypi/django-memento-framework

Perhaps we can work with the creators to create a basic memento python library. I'll shoot palewire/Ben Welsh an email.

Searching github, I see also:

https://github.com/mementoweb/pymemento
https://github.com/samwyse/PyMemento

None are on pypi, but the first one looks mature enough to be on pypi, but the second one is more 'useful'.

I've emailed Ben Welsh and raised quite a few issues and pull requests against https://github.com/mementoweb/pymemento , which when resolved would make it a reasonable library for raw access to memento data. probably the biggest problem is that requests doesnt parse the HTTP Link header correctly. https://github.com/kennethreitz/requests/issues/2707

https://github.com/samwyse/PyMemento has the basis of an archive depot (http://mementoweb.org/depot/) , which could be very useful if we want an archived link for a website, and that website doesnt provide its own memento link headers. However it doesnt include WebCite, as WebCite doesnt support memento. :/

Note that currently the API at http://archive.org/wayback/available?url=example.com is 503, however their Memento API is working OK.

$ wget -O 'archive.links' 'http://web.archive.org/web/timemap/link/http://www.w3.org/TR/webarch/'
--2015-08-07 01:20:05--  http://web.archive.org/web/timemap/link/http://www.w3.org/TR/webarch/
Resolving web.archive.org (web.archive.org)... 207.241.224.26
Connecting to web.archive.org (web.archive.org)|207.241.224.26|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/link-format]
Saving to: ‘archive.links’

archive.links                                      [  <=>                                                                                               ]  89.15K   176KB/s   in 0.5s   

2015-08-07 01:20:06 (176 KB/s) - ‘archive.links’ saved [91289]

$ wget -S 'http://web.archive.org/web/http://www.w3.org/TR/webarch/'
--2015-08-07 01:21:24--  http://web.archive.org/web/http://www.w3.org/TR/webarch/
Resolving web.archive.org (web.archive.org)... 207.241.224.26
Connecting to web.archive.org (web.archive.org)|207.241.224.26|:80... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 302 Moved Temporarily
  Server: Tengine/2.1.0
  Date: Thu, 06 Aug 2015 15:21:25 GMT
  Content-Type: text/html
  Transfer-Encoding: chunked
  Connection: keep-alive
  set-cookie: wayback_server=3; Domain=archive.org; Path=/; Expires=Sat, 05-Sep-15 15:21:25 GMT;
  Location: /web/20150711085113/http://www.w3.org/TR/webarch/
  Vary: accept-datetime
  Link: <http://www.w3.org/TR/webarch/>; rel="original", <http://web.archive.org/web/timemap/link/http://www.w3.org/TR/webarch/>; rel="timemap"; type="application/link-format", <http://web.archive.org/web/20150711085113/http://www.w3.org/TR/webarch/>; rel="last memento"; datetime="Sat, 11 Jul 2015 08:51:13 GMT", <http://web.archive.org/web/20020911073933/http://www.w3.org/TR/webarch/>; rel="first memento"; datetime="Wed, 11 Sep 2002 07:39:33 GMT", <http://web.archive.org/web/20150630093225/http://www.w3.org/TR/webarch/>; rel="prev memento"; datetime="Tue, 30 Jun 2015 09:32:25 GMT"
  X-Link-JSON: {"closest":{"wb_url":"http://web.archive.org/web/20150711085113/http://www.w3.org/TR/webarch/","timestamp":"20150711085113","status":"200"}}
  X-Archive-Wayback-Perf: {"IndexLoad":49,"IndexQueryTotal":49,"RobotsFetchTotal":2,"RobotsRedis":2,"RobotsTotal":2,"Total":52}
  X-Archive-Playback: 0
  X-Page-Cache: HIT
Location: /web/20150711085113/http://www.w3.org/TR/webarch/ [following]
--2015-08-07 01:21:25--  http://web.archive.org/web/20150711085113/http://www.w3.org/TR/webarch/
Reusing existing connection to web.archive.org:80.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  Server: Tengine/2.1.0
  Date: Thu, 06 Aug 2015 15:21:25 GMT
  Content-Type: text/html;charset=utf-8
  Content-Length: 207230
  Connection: keep-alive
  Memento-Datetime: Sat, 11 Jul 2015 08:51:13 GMT
  Link: <http://www.w3.org/TR/webarch/>; rel="original", <http://web.archive.org/web/timemap/link/http://www.w3.org/TR/webarch/>; rel="timemap"; type="application/link-format", <http://web.archive.org/web/http://www.w3.org/TR/webarch/>; rel="timegate", <http://web.archive.org/web/20020911073933/http://www.w3.org/TR/webarch/>; rel="first memento"; datetime="Wed, 11 Sep 2002 07:39:33 GMT", <http://web.archive.org/web/20150703051811/http://www.w3.org/TR/webarch/>; rel="prev memento"; datetime="Fri, 03 Jul 2015 05:18:11 GMT", <http://web.archive.org/web/20150711085113/http://www.w3.org/TR/webarch/>; rel="memento"; datetime="Sat, 11 Jul 2015 08:51:13 GMT", <http://web.archive.org/web/20150713010337/http://www.w3.org/TR/webarch/>; rel="next memento"; datetime="Mon, 13 Jul 2015 01:03:37 GMT", <http://web.archive.org/web/20150802013147/http://www.w3.org/TR/webarch/>; rel="last memento"; datetime="Sun, 02 Aug 2015 01:31:47 GMT"
  X-Archive-Orig-etag: "2fabd-3eb3922c2ffc0"
  X-Archive-Guessed-Charset: utf-8
  X-Archive-Orig-cache-control: max-age=31536000
  X-Archive-Orig-content-type: text/html; charset=utf-8
  X-Archive-Orig-server: Apache/2
  X-Archive-Orig-access-control-allow-origin: *
  X-Archive-Orig-last-modified: Tue, 14 Dec 2004 20:19:19 GMT
  X-Archive-Orig-expires: Sun, 10 Jul 2016 08:51:13 GMT
  X-Archive-Orig-accept-ranges: bytes
  X-Archive-Orig-connection: close
  X-Archive-Orig-date: Sat, 11 Jul 2015 08:51:13 GMT
  X-Archive-Orig-content-length: 195261
  X-Archive-Orig-p3p: policyref="http://www.w3.org/2014/08/p3p.xml"
  X-Archive-Wayback-Perf: {"IndexLoad":52,"IndexQueryTotal":52,"RobotsFetchTotal":2,"RobotsRedis":2,"RobotsTotal":3,"Total":97,"WArcResource":35}
  X-Archive-Playback: 1
  X-Page-Cache: HIT
Length: 207230 (202K) [text/html]
Saving to: ‘index.html’

index.html                                   100%[===================================================================================================>] 202.37K   226KB/s   in 0.9s   

2015-08-07 01:21:26 (226 KB/s) - ‘index.html’ saved [207230/207230]

Change 230781 had a related patch set uploaded (by John Vandenberg):
Memento support for weblinkchecker

https://gerrit.wikimedia.org/r/230781

Change 230781 merged by jenkins-bot:
Memento support for weblinkchecker

https://gerrit.wikimedia.org/r/230781

jayvdb claimed this task.

Change 181241 had a related patch set uploaded (by John Vandenberg):
Allow datetime to specify the desired archive datestamp

https://gerrit.wikimedia.org/r/181241

Change 181241 abandoned by John Vandenberg:
Allow datetime to specify the desired archive datestamp

Reason:
both weblib functions are currently broken by server side problems. the IA breakage might be temporary, but weblib has effectively been replaced by an external package. if IA starts working again, and someone needs weblib without memento_client, this can be unabandoned and fixed up.

https://gerrit.wikimedia.org/r/181241

Change 232703 had a related patch set uploaded (by John Vandenberg):
Use published memento_client v0.5.1

https://gerrit.wikimedia.org/r/232703

Change 232703 merged by jenkins-bot:
Use published memento_client v0.5.1

https://gerrit.wikimedia.org/r/232703

Change 511388 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [tests] Announce FutureWarning with weblib methods

https://gerrit.wikimedia.org/r/511388

Change 511388 merged by jenkins-bot:
[pywikibot/core@master] [tests] Announce FutureWarning with weblib methods

https://gerrit.wikimedia.org/r/511388

Change 526522 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [cleanup] Remove weblib tests

https://gerrit.wikimedia.org/r/526522

Change 526522 merged by jenkins-bot:
[pywikibot/core@master] [cleanup] Remove weblib tests

https://gerrit.wikimedia.org/r/526522

Change 563665 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [cleanup] remove weblib.py in favour of mementoclient

https://gerrit.wikimedia.org/r/563665

Change 563665 merged by jenkins-bot:
[pywikibot/core@master] [cleanup] remove weblib.py in favour of mementoclient

https://gerrit.wikimedia.org/r/563665