On URL submission, look up the archived page in the Internet Archive's index and add to the return data
Open, Stalled, HighPublic8 Story Points

Description

On parsing a URL (any URL), use the Internet Archive's API to generate the (?)archive_url and archive_date fields in the return.

Where the URL fails (4xx, 5xx), also set a use_archive flag or whatever indicating that the original URL is not currently valid. (This is T95388: Try to find link in archive.org for 520 page.)

Jdforrester-WMF updated the task description. (Show Details)
Jdforrester-WMF raised the priority of this task from to High.
Jdforrester-WMF added a subscriber: Jdforrester-WMF.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 11 2015, 7:02 PM

Discussed at WikiConference USA in DC with @Sadads, @Harej and others.

Sadads set Security to None.
Mvolz added a subscriber: Mvolz.Oct 11 2015, 8:11 PM

For citoid's purposes, I feel that the 'use-archive' flag would mostly be
false positives. The vast majority of links don't expire in between the
time when they are viewed by the user and are added via citoid, and we
relatively frequently can't scrape certain pages for technical reason which
are otherwise readable by a human using a browser.

For citoid's purposes, I feel that the 'use-archive' flag would mostly be
false positives. The vast majority of links don't expire in between the
time when they are viewed by the user and are added via citoid, and we
relatively frequently can't scrape certain pages for technical reason which
are otherwise readable by a human using a browser.

I was mostly going from a comment that 5% of links in references were 404s at the time they were created…

jayvdb added a subscriber: jayvdb.Oct 11 2015, 10:42 PM
Jdforrester-WMF moved this task from To Triage to TR0: Interrupt on the VisualEditor board.
Mvolz moved this task from Backlog to IO Tasks on the Citoid board.Jan 12 2016, 10:15 AM
Sadads added a subscriber: kaldari.Jan 12 2016, 5:20 PM

@kaldari needs to be on this as well.

DannyH added a subscriber: DannyH.Jan 29 2016, 12:26 AM

Are there any plans to work on this in the near future? There is also a proposal to implement a similar idea as a Lua module (https://www.mediawiki.org/wiki/User:Legoktm/archive.txt). The Citoid implementation would probably be a better solution though (as discussed at T120850#1970673).

Qgil removed a subscriber: Qgil.Oct 31 2016, 12:24 PM
Mvolz claimed this task.May 12 2017, 4:37 PM
czar added a comment.EditedJul 12 2017, 2:46 PM

@Cyberpower678 do you see a route for https://tools.wmflabs.org/iabot to integrate here and do the heavy lifting?

Perhaps there's a way to make the new https://tools.wmflabs.org/iabot do the heavy lifting on this

To an extent, yes. https://tools.wmflabs.org/iabot/api.php can provide fast information on the URLs it knows of with archives that are confirmed working. It's being regularly maintained.

Documentation at https://meta.wikimedia.org/wiki/InternetArchiveBot/API

So in response @czar the tool has a fast API. You can look up multiple URLs in a single query and get a near instantaneous response.

https://meta.wikimedia.org/wiki/InternetArchiveBot/API#action.3Dsearchurldata and https://meta.wikimedia.org/wiki/InternetArchiveBot/API#action.3Dsearchurlfrompage

may prove useful.

The information returned, contains the live state of the URL, and an associated archive that should be fully functional in terms of delivering site content. It also tells you if it has ever tried to check for an archive or not, as well as if it thinks an archive is available or should be inquired about on the Wayback Machine. It's a great first step considering that the API response is much faster than the Wayback Machine is. This also helps to lighten the load on the Wayback API, which tends to get overloaded every now and then.

Mvolz added a comment.EditedThu, Dec 7, 10:54 AM

Thanks @Cyberpower678 - I will check it out.

I am a little worried about using a tools lab tool for an in production service because of uptime reasons. Theoretically wayback machine should be more stable... however.... you are absolutely right about speed.

I am running tests using the wayback availability API right now. Most of our scraping tests have double the timeout because they do take longer (4 seconds). However I doubled that, and then had to double it *again* to 16 seconds because half the time or so the tests were exceeding 8 seconds. Even if we paralellise this, it will likely cause the entire request to time out, and will increase response times for nearly every request made.

One option is to just give up if the request isn't done by the time everything else is finished. This is the approach we took with pubmed where we had similar problems. (@mobrovac did that ever get into production? It takes a setting change, I wasn't sure if that part was actually done.)

Or we could force an earlier timeout to guarantee we aren't sinking entire requests because of this. Or some combination. Like if it finishes early, let it wait for wayback, but not too long.

@mobrovac could you perhaps weigh in on this? What are your thoughts about using the tool lab tool vs. wayback availability API? Any thoughts about what to do about performance?

Mvolz added a comment.Thu, Dec 7, 10:59 AM

(T162886 was where we dealt with pubmed performance issues)

Mvolz changed the task status from Open to Stalled.Thu, Dec 7, 11:01 AM