Page MenuHomePhabricator

Add support for other common archiving services
Closed, ResolvedPublic

Description

archive.org isn't the only archive out there being used, though it is the most commonly used one.

While Cyberbot doesn't necessarily need to be querying these services for archives, it should at least be able to acknowledge sites such as archive.is, webcite, memento, etc....

Event Timeline

I have developed a function that can now resolve webcite URLs. It's slow unfortunately, because WebCite is agonizingly slow. But it does bring the bot one step closer to recognizing and handling other archives.

So far haven't been able to figure out how to resolve archive.is links.

If anyone knows of the API that may exist there, please let me know.

WebCite is broken now. :/

@Green_Cardamom This is why WayBack is most dominantly used. :p

Since the formatting of the errors it's spitting out is consistent, and seems to be related I just used regex to filter it out. Leaving the functional XML intact.

Recommend Memento's API for resolving links to Webcite, archive.is and others.

Wayback is a big crawler so has a huge database. Most of the rest are archive-on-demand, or not very big crawl. But Wayback has robots.txt so links disappear over time. One idea is automatically send any on-wiki archive.org links to Webcite or somewhere as backup in case of robots but need community discussion.

So far haven't been able to figure out how to resolve archive.is links.

If anyone knows of the API that may exist there, please let me know.

What is meant by "resolve" - given an existing archive.is link, determine the original link?

Probably unrelated.. webcite and archive.is support a long-form URL (which should be default IMO)

http://www.webcitation.org/query?url=http%3A%2F%2Fgoogle.com&date=2016-05-30
https://archive.is/20160527055310/https://www.google.com/

What is meant by "resolve" - given an existing archive.is link, determine the original link?

Probably unrelated.. webcite and archive.is support a long-form URL (which should be default IMO)

http://www.webcitation.org/query?url=http%3A%2F%2Fgoogle.com&date=2016-05-30
https://archive.is/20160527055310/https://www.google.com/

Exactly. It's supposed to better the ability fix archive URLs that are present in the URL parameter. As an added bonus, for WebCite, it universalizes the URL format, and converts to shorthand, as they look nicer.

https://en.wikipedia.org/wiki/Typhoon%20Bolaven%20(2012)?diff=prev&oldid=728319994 and https://en.wikipedia.org/wiki/Light%20Me%20Up%20Tour?diff=prev&oldid=728409601 are really good examples.

I have figured out how to resolve the archive.is without the need for an AP there, but it is yet to be implemented and tested. I will get to that later today.

Looks good.

Re: plain language URL vs. URL shortened. The former is probably preferred if given the option:

https://en.wikipedia.org/wiki/Wikipedia:Using_archive.is#Use_within_Wikipedia
https://en.wikipedia.org/wiki/Wikipedia:Using_WebCite#Use_within_Wikipedia

The first says the long format is preferred, the second says either is OK :)

There's a general name for this in computer science I forget what it's called (there's a Wikipedia article about it) but plain language URLs provide more information and ability to do things for end users. The URL shortening services such as bit.ly are blacklisted and this is one rationale some users have made for blacklisting archive.is. Of course the original URL is embedded in the template so there is that, but there are bare links too. It's probably something there should be an RfC to get consensus since there seems to be varied opinions. Personally I think it's safe to go with the long form since it mirrors archive.org which have no complaints. If you want, I have no problem starting an RfC, not specific to Cyberbot but webcite and archive.is preferred URL usage.

Forgot to add, the reason URL shortening services are blacklisted, it is a way for spammers to hide spam.

You are free to start an RfC. Cyberbot shortens the URL, as that is the URL returned, by WebCite's API, as the archive URL.

How did you resolve webcite and archive.is to long form? Is it an API call or web scrape or ? Is it something a user could do easily without coding?

What do you mean by long form? I'm resolving the WebCite to their original URL, by using their API. As for archive.is, it is done by web scraping, since they have no API.

Right, original URL. Alright that's helpful to know.

It will probably at some point be a little more advanced when handling other archives, but for now, it's only sticking to acknowledging other major archiving services and using other templates, and not just wayback.

My Archive.is resolver now works correctly. :D

Closing as resolved. IABot can now recognize the four major archiving services.

Opened an RfC.

https://en.wikipedia.org/wiki/Wikipedia_talk:Using_archive.is#RfC:_Should_we_use_short_or_long_format_URLs.3F

Good idea to get this documented early before anyone starts complaining. Likely will close "either" (or no consensus for short or long).