Page MenuHomePhabricator

On URL submission, look up the archived page in the Internet Archive's index and add to the return data
Open, MediumPublic8 Estimated Story Points

Description

On parsing a URL (any URL), use the Internet Archive's API to generate the (?)archive_url and archive_date fields in the return.

Where the URL fails (4xx, 5xx), also set a use_archive flag or whatever indicating that the original URL is not currently valid. (This is T95388: Try to find link in archive.org when direct scraping fails.)

Event Timeline

Jdforrester-WMF raised the priority of this task from to High.
Jdforrester-WMF updated the task description. (Show Details)
Jdforrester-WMF subscribed.

Discussed at WikiConference USA in DC with @Sadads, @Harej and others.

For citoid's purposes, I feel that the 'use-archive' flag would mostly be
false positives. The vast majority of links don't expire in between the
time when they are viewed by the user and are added via citoid, and we
relatively frequently can't scrape certain pages for technical reason which
are otherwise readable by a human using a browser.

For citoid's purposes, I feel that the 'use-archive' flag would mostly be
false positives. The vast majority of links don't expire in between the
time when they are viewed by the user and are added via citoid, and we
relatively frequently can't scrape certain pages for technical reason which
are otherwise readable by a human using a browser.

I was mostly going from a comment that 5% of links in references were 404s at the time they were created…

Are there any plans to work on this in the near future? There is also a proposal to implement a similar idea as a Lua module (https://www.mediawiki.org/wiki/User:Legoktm/archive.txt). The Citoid implementation would probably be a better solution though (as discussed at T120850#1970673).

@Cyberpower678 do you see a route for https://tools.wmflabs.org/iabot to integrate here and do the heavy lifting?

Perhaps there's a way to make the new https://tools.wmflabs.org/iabot do the heavy lifting on this

To an extent, yes. https://tools.wmflabs.org/iabot/api.php can provide fast information on the URLs it knows of with archives that are confirmed working. It's being regularly maintained.

Documentation at https://meta.wikimedia.org/wiki/InternetArchiveBot/API

So in response @czar the tool has a fast API. You can look up multiple URLs in a single query and get a near instantaneous response.

https://meta.wikimedia.org/wiki/InternetArchiveBot/API#action.3Dsearchurldata and https://meta.wikimedia.org/wiki/InternetArchiveBot/API#action.3Dsearchurlfrompage

may prove useful.

The information returned, contains the live state of the URL, and an associated archive that should be fully functional in terms of delivering site content. It also tells you if it has ever tried to check for an archive or not, as well as if it thinks an archive is available or should be inquired about on the Wayback Machine. It's a great first step considering that the API response is much faster than the Wayback Machine is. This also helps to lighten the load on the Wayback API, which tends to get overloaded every now and then.

Thanks @Cyberpower678 - I will check it out.

I am a little worried about using a tools lab tool for an in production service because of uptime reasons. Theoretically wayback machine should be more stable... however.... you are absolutely right about speed.

I am running tests using the wayback availability API right now. Most of our scraping tests have double the timeout because they do take longer (4 seconds). However I doubled that, and then had to double it *again* to 16 seconds because half the time or so the tests were exceeding 8 seconds. Even if we paralellise this, it will likely cause the entire request to time out, and will increase response times for nearly every request made.

One option is to just give up if the request isn't done by the time everything else is finished. This is the approach we took with pubmed where we had similar problems. (@mobrovac did that ever get into production? It takes a setting change, I wasn't sure if that part was actually done.)

Or we could force an earlier timeout to guarantee we aren't sinking entire requests because of this. Or some combination. Like if it finishes early, let it wait for wayback, but not too long.

@mobrovac could you perhaps weigh in on this? What are your thoughts about using the tool lab tool vs. wayback availability API? Any thoughts about what to do about performance?

(T162886 was where we dealt with pubmed performance issues)

Mvolz changed the task status from Open to Stalled.Dec 7 2017, 11:01 AM
Mvolz lowered the priority of this task from High to Medium.Mar 26 2020, 10:07 AM

This will be wonderful... We will bring it into the ref tool bar I hope https://en.wikipedia.org/wiki/Wikipedia:RefToolbar/2.0

Just an FYI, but I investigated adding this with a WIP patch here (a few years ago): https://gerrit.wikimedia.org/r/c/mediawiki/services/citoid/+/375810

At the time, it made pretty much every request time out. Response time was just too slow.

Even if archive.org did its work faster, if the page being archived is itself served slowly, then it takes archive.org longer, and by the time the link gets to us it's been too long. We have that problem too, which is why we try to return citations from databases (i.e. crossref) whenever possible, and scrape the actual page on the server as a last resort. Archive.org doesn't have that option because they need the whole page, not just the metadata. 

Even if there have been changes to the api that allows us to predict what the url will be before it actually finishes archiving, I don't see us pursuing this as something we want to do in real time. Even if we knew what the url would be, it's a problem if we add archive links to things that end up not being archivable at all anyway too, and I don't really see a way around that, other than running a separate bots that removes links to archive.org pages that don't exist because the resource wasn't archivable! Either way it involves bots, and I think users would be mad if we added non-viable links even if we had bots clean it up :/. So I don't think that route is viable. 

In the end, I think this is better left to bots, which can take their time.

Declined because I think the adding of archive links is better left to bots.

I would like to emphasize that the availability API has seen improvements since 2017. Is it possible you can redo your investigation a bit?

I would like to emphasize that the availability API has seen improvements since 2017. Is it possible you can redo your investigation a bit?

The problem is that if it's already archived, how can we be sure the cited version matches the archived version? I think for newly created citations we need to archive it at time of citation :(. But if people are happy with the archived link just being close, rather than at the time of citation, maybe that'd be okay. I'd just be worried about people citing something and then the archive link pointing to something archived 3 years ago that was totally different!

But this is a good point. For links that don't have a live version at all (i.e. 404s, we could investigate and see if we get a good response from the api), that's probably better than doing nothing.

Change 375810 had a related patch set uploaded (by Mvolz; owner: Mvolz):
[mediawiki/services/citoid@master] [WIP] Investigate wayback availability api

https://gerrit.wikimedia.org/r/375810

Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See the email sent to the task assignee on February 06th 2022 (and T295729).

Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome.

If this task has been resolved in the meantime, or should not be worked on ("declined"), please update its task status via "Add Action… 🡒 Change Status".

Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.

Mvolz changed the task status from Open to Stalled.Jan 10 2024, 12:27 PM

I would like to emphasize that the availability API has seen improvements since 2017. Is it possible you can redo your investigation a bit?

We tried this again for a limited case: https://phabricator.wikimedia.org/T95388 but that change could easily be expanded to this use case.

It's now really fast (great), unfortunately it seems to be really unstable. I.e. often it returns a completely empty object for the nearest object, even when one is in fact available. :/

Change #375810 merged by jenkins-bot:

[mediawiki/services/citoid@master] Scrape archive.org if page isn't available

https://gerrit.wikimedia.org/r/375810

Mvolz changed the task status from Stalled to Open.Thu, Dec 5, 12:01 PM
Mvolz claimed this task.