On URL submission, look up the archived page in the Internet Archive's index and add to the return data
Open, MediumPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	Jdforrester-WMF
	Oct 11 2015, 7:02 PM

Description

On parsing a URL (any URL), use the Internet Archive's API to generate the (?)archive_url and archive_date fields in the return.

Where the URL fails (4xx, 5xx), also set a use_archive flag or whatever indicating that the original URL is not currently valid. (This is T95388: Try to find link in archive.org when direct scraping fails.)

Details

	Subject	Repo	Branch	Lines +/-
	Scrape archive.org if page isn't available	mediawiki/services/citoid	master	+430 -5 K

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T362379 Several major news websites (NYT, NPR, Reuters...) block citoid
Open	ppelberg	T95388 Try to find link in archive.org when direct scraping fails
Open	Mvolz	T115224 On URL submission, look up the archived page in the Internet Archive's index and add to the return data

Event Timeline

Jdforrester-WMF created this task.Oct 11 2015, 7:02 PM

Jdforrester-WMF raised the priority of this task from to High.

Jdforrester-WMF updated the task description. (Show Details)

Jdforrester-WMF added projects: Citoid, VisualEditor, VisualEditor-MediaWiki-References.

Jdforrester-WMF subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 11 2015, 7:02 PM

Jdforrester-WMF added a parent task: T95388: Try to find link in archive.org when direct scraping fails.Oct 11 2015, 7:02 PM

• Quiddity subscribed.Oct 11 2015, 7:03 PM

Discussed at WikiConference USA in DC with @Sadads, @Harej and others.

This is a citoid-specific answer to T89438: Automated archiving of URLs.

Sadads added a project: The-Wikipedia-Library.Oct 11 2015, 7:07 PM

Sadads set Security to None.

For citoid's purposes, I feel that the 'use-archive' flag would mostly be
false positives. The vast majority of links don't expire in between the
time when they are viewed by the user and are added via citoid, and we
relatively frequently can't scrape certain pages for technical reason which
are otherwise readable by a human using a browser.

In T115224#1718434, @Mvolz wrote:

For citoid's purposes, I feel that the 'use-archive' flag would mostly be
false positives. The vast majority of links don't expire in between the
time when they are viewed by the user and are added via citoid, and we
relatively frequently can't scrape certain pages for technical reason which
are otherwise readable by a human using a browser.

I was mostly going from a comment that 5% of links in references were 404s at the time they were created…

jayvdb subscribed.Oct 11 2015, 10:42 PM

Jdforrester-WMF edited a custom field.Oct 13 2015, 7:08 PM

Jdforrester-WMF moved this task from To Triage to TR0: Interrupt on the VisualEditor board.

Sadads added a project: Internet-Archive.Oct 13 2015, 8:30 PM

• DannyH mentioned this in T120850: Investigation: Migrate dead external links to archives.Dec 9 2015, 10:37 PM

Mvolz moved this task from Backlog to IO Tasks on the Citoid board.Jan 12 2016, 10:15 AM

@kaldari needs to be on this as well.

• DannyH subscribed.Jan 29 2016, 12:26 AM

Are there any plans to work on this in the near future? There is also a proposal to implement a similar idea as a Lua module (https://www.mediawiki.org/wiki/User:Legoktm/archive.txt). The Citoid implementation would probably be a better solution though (as discussed at T120850#1970673).

Jdforrester-WMF moved this task from TR0: Interrupt to External and Administrivia on the VisualEditor board.Aug 9 2016, 7:40 PM

Mvolz moved this task from IO Tasks to Service on the Citoid board.Oct 28 2016, 3:13 PM

Mvolz merged a task: T89438: Automated archiving of URLs.Oct 28 2016, 3:40 PM

Mvolz added subscribers: Daniel_Mietchen, Jay8g, Ocaasi and 10 others.

Qgil unsubscribed.Oct 31 2016, 12:24 PM

Mvolz merged a task: T159182: Look up user-provided URLs with the Internet Archive and add to the response if they have one.Mar 20 2017, 10:54 AM

Mvolz added a subscriber: czar.

Mvolz claimed this task.May 12 2017, 4:37 PM

@Cyberpower678 do you see a route for https://tools.wmflabs.org/iabot to integrate here and do the heavy lifting?

In T115224#3430746, @czar wrote:

Perhaps there's a way to make the new https://tools.wmflabs.org/iabot do the heavy lifting on this

To an extent, yes. https://tools.wmflabs.org/iabot/api.php can provide fast information on the URLs it knows of with archives that are confirmed working. It's being regularly maintained.

Documentation at https://meta.wikimedia.org/wiki/InternetArchiveBot/API

So in response @czar the tool has a fast API. You can look up multiple URLs in a single query and get a near instantaneous response.

https://meta.wikimedia.org/wiki/InternetArchiveBot/API#action.3Dsearchurldata and https://meta.wikimedia.org/wiki/InternetArchiveBot/API#action.3Dsearchurlfrompage

may prove useful.

The information returned, contains the live state of the URL, and an associated archive that should be fully functional in terms of delivering site content. It also tells you if it has ever tried to check for an archive or not, as well as if it thinks an archive is available or should be inquired about on the Wayback Machine. It's a great first step considering that the API response is much faster than the Wayback Machine is. This also helps to lighten the load on the Wayback API, which tends to get overloaded every now and then.

Thanks @Cyberpower678 - I will check it out.

I am a little worried about using a tools lab tool for an in production service because of uptime reasons. Theoretically wayback machine should be more stable... however.... you are absolutely right about speed.

I am running tests using the wayback availability API right now. Most of our scraping tests have double the timeout because they do take longer (4 seconds). However I doubled that, and then had to double it *again* to 16 seconds because half the time or so the tests were exceeding 8 seconds. Even if we paralellise this, it will likely cause the entire request to time out, and will increase response times for nearly every request made.

One option is to just give up if the request isn't done by the time everything else is finished. This is the approach we took with pubmed where we had similar problems. (@mobrovac did that ever get into production? It takes a setting change, I wasn't sure if that part was actually done.)

Or we could force an earlier timeout to guarantee we aren't sinking entire requests because of this. Or some combination. Like if it finishes early, let it wait for wayback, but not too long.

@mobrovac could you perhaps weigh in on this? What are your thoughts about using the tool lab tool vs. wayback availability API? Any thoughts about what to do about performance?

(T162886 was where we dealt with pubmed performance issues)

Mvolz changed the task status from Open to Stalled.Dec 7 2017, 11:01 AM

Samwalton9-WMF mentioned this in T199193: [3.7] Instant caching of linked resources via Internet Archive.Jul 10 2018, 9:48 AM

• Tbayer subscribed.Jul 31 2018, 2:32 PM

Izno subscribed.Nov 6 2018, 6:53 PM

Liuxinyu970226 awarded a token.Nov 11 2018, 6:55 AM

Liuxinyu970226 subscribed.

Samwalton9-WMF subscribed.Aug 14 2019, 2:45 PM

Samwalton9-WMF mentioned this in T236252: Add a button on URL fields to archive the URL.Nov 4 2019, 2:19 PM

Mvolz lowered the priority of this task from High to Medium.Mar 26 2020, 10:07 AM

This will be wonderful... We will bring it into the ref tool bar I hope https://en.wikipedia.org/wiki/Wikipedia:RefToolbar/2.0

Samwalton9-WMF removed a project: The-Wikipedia-Library.Jun 3 2020, 8:15 AM

Just an FYI, but I investigated adding this with a WIP patch here (a few years ago): https://gerrit.wikimedia.org/r/c/mediawiki/services/citoid/+/375810

At the time, it made pretty much every request time out. Response time was just too slow.

Even if archive.org did its work faster, if the page being archived is itself served slowly, then it takes archive.org longer, and by the time the link gets to us it's been too long. We have that problem too, which is why we try to return citations from databases (i.e. crossref) whenever possible, and scrape the actual page on the server as a last resort. Archive.org doesn't have that option because they need the whole page, not just the metadata.

Even if there have been changes to the api that allows us to predict what the url will be before it actually finishes archiving, I don't see us pursuing this as something we want to do in real time. Even if we knew what the url would be, it's a problem if we add archive links to things that end up not being archivable at all anyway too, and I don't really see a way around that, other than running a separate bots that removes links to archive.org pages that don't exist because the resource wasn't archivable! Either way it involves bots, and I think users would be mad if we added non-viable links even if we had bots clean it up :/. So I don't think that route is viable.

In the end, I think this is better left to bots, which can take their time.

Declined because I think the adding of archive links is better left to bots.

I would like to emphasize that the availability API has seen improvements since 2017. Is it possible you can redo your investigation a bit?

• Quiddity unsubscribed.Aug 27 2020, 6:13 PM

In T115224#6415888, @Cyberpower678 wrote:

I would like to emphasize that the availability API has seen improvements since 2017. Is it possible you can redo your investigation a bit?

The problem is that if it's already archived, how can we be sure the cited version matches the archived version? I think for newly created citations we need to archive it at time of citation :(. But if people are happy with the archived link just being close, rather than at the time of citation, maybe that'd be okay. I'd just be worried about people citing something and then the archive link pointing to something archived 3 years ago that was totally different!

But this is a good point. For links that don't have a live version at all (i.e. 404s, we could investigate and see if we get a good response from the api), that's probably better than doing nothing.

Mvolz reopened this task as Open.Nov 20 2020, 12:28 PM

Change 375810 had a related patch set uploaded (by Mvolz; owner: Mvolz):
[mediawiki/services/citoid@master] [WIP] Investigate wayback availability api

https://gerrit.wikimedia.org/r/375810

gerritbot added a project: Patch-For-Review.Nov 30 2020, 4:44 PM

• Elitre unsubscribed.Jan 25 2021, 1:07 PM

Harej moved this task from Backlog to Integrations on the Internet-Archive board.Nov 2 2021, 7:40 PM

Aklapper edited projects, added Patch-Needs-Improvement; removed Patch-For-Review.Feb 3 2022, 4:36 PM

Aklapper removed subscribers: • Tbayer, • mobrovac, • Yana.

Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See the email sent to the task assignee on February 06th 2022 (and T295729).

Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome.

If this task has been resolved in the meantime, or should not be worked on ("declined"), please update its task status via "Add Action… 🡒 Change Status".

Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.

Mvolz merged a task: T354113: Automatically add permanent links to URL citations generated by Cite tool.Jan 10 2024, 10:10 AM

Mvolz added a subscriber: Tadashi.

Mvolz changed the task status from Open to Stalled.Jan 10 2024, 12:27 PM

Una_tantum subscribed.Apr 23 2024, 12:44 PM

Esanders mentioned this in T95388: Try to find link in archive.org when direct scraping fails.Jun 25 2024, 4:15 PM

In T115224#6415888, @Cyberpower678 wrote:

I would like to emphasize that the availability API has seen improvements since 2017. Is it possible you can redo your investigation a bit?

We tried this again for a limited case: https://phabricator.wikimedia.org/T95388 but that change could easily be expanded to this use case.

It's now really fast (great), unfortunately it seems to be really unstable. I.e. often it returns a completely empty object for the nearest object, even when one is in fact available. :/

Krinkle subscribed.Sep 24 2024, 9:09 PM

Change #375810 merged by jenkins-bot:

[mediawiki/services/citoid@master] Scrape archive.org if page isn't available

https://gerrit.wikimedia.org/r/375810

Mvolz changed the task status from Stalled to Open.Thu, Dec 5, 12:01 PM

Mvolz claimed this task.

On URL submission, look up the archived page in the Internet Archive's index and add to the return dataOpen, MediumPublic8 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

On URL submission, look up the archived page in the Internet Archive's index and add to the return data
Open, MediumPublic8 Estimated Story Points
Actions

Related Objects
Search...