Have IABot use long form URLs for archive.is and webcite
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Cyberpower678
	Aug 17 2016, 3:12 PM

Description

Per https://en.wikipedia.org/wiki/Wikipedia_talk:Using_archive.is#RfC:_Should_we_use_short_or_long_format_URLs.3F, archive URLs should be in long form.

Related Objects
Search...

Status	Assigned	Task
Resolved	Cyberpower678	T120433 Migrate dead external links to archives
Resolved	Cyberpower678	T136128 Add support for other common archiving services
Resolved	Cyberpower678	T141347 Create and test v1.2 of InternetArchiveBot (tracking)
Resolved	Cyberpower678	T143214 Have IABot use long form URLs for archive.is and webcite

Event Timeline

Cyberpower678 created this task.Aug 17 2016, 3:12 PM

Cyberpower678 added a parent task: T141347: Create and test v1.2 of InternetArchiveBot (tracking).

Thank you for adding this.

The closer gave this example long form:

http://archive.is/YYYY.MM.DD-hhmmss/http://www.example.com

But there was no discussion about it in the RfC.

Personally I think this is better

http://archive.is/YYYYMMDDhhmmss/http://www.example.com

Since the 14-digit is the same as Internet Archive and WebCite, which makes tool building easier. Also easier for end users, more logical and less clutter on pages.

For WebCite the German wiki doesn't use dates in the URL as it's encoded in the ID. That is a good idea as it reduces errors in case the ID date doesn't match the &date .. it removes a redundant piece of info that can cause failures .. however the date provides transparency. Not sure which is better choice here, but leaning towards not using &date

So here are some problems that need to be addressed. Every archive link in the DB is short form. There are no checks done on URLs from the DB, since the checks are usually done before adding to the DB.

One idea: if you can make a list of all the short form archive URLs (webcite and archive.is) in the CB DB, I can write a program to create corresponding long form version. From that table create an SQL query for the CB DB to modify to long form, and then another fairly simple program to change the existing links on Wikipedia from short form to long form (with bot approval). That will reset everything back to zero and going forward when IABot sees a new short form it will change to long form in the DB and on Wikipedia (I think?)

Cyberpower678 added a project: InternetArchiveBot.Aug 19 2016, 3:51 PM

Cyberpower678 moved this task from Inbox to Feature requests on the InternetArchiveBot board.

Cyberpower678 moved this task from Feature requests to New feature on the InternetArchiveBot board.Aug 19 2016, 11:03 PM

Cyberpower678 moved this task from New feature to v1.2 on the InternetArchiveBot board.Aug 21 2016, 2:12 AM

Cyberpower678 edited projects, added InternetArchiveBot (v1.2); removed InternetArchiveBot.

Cyberpower678 moved this task from Unsorted to New feature on the InternetArchiveBot (v1.2) board.

In T143214#2564614, @Green_Cardamom wrote:

One idea: if you can make a list of all the short form archive URLs (webcite and archive.is) in the CB DB, I can write a program to create corresponding long form version. From that table create an SQL query for the CB DB to modify to long form, and then another fairly simple program to change the existing links on Wikipedia from short form to long form (with bot approval). That will reset everything back to zero and going forward when IABot sees a new short form it will change to long form in the DB and on Wikipedia (I think?)

I think it would be easier if the script ran directly with the DB. I'm going to develop a cleanup script to do a bunch of cleanup on the DB. It's going to verify every archive URL is working, properly formatted, and if using HTTPS when applicable.

@Trappist_the_monk I'm pinging you as the maintainer of the CS templates and modules. Per the RfC it would probably be wise to implement a function to catch short form and throw an error. Essentially if the template can't extract the date and original URL from the archive URL it should throw an error, unless it is an unknown archive.

I'm building the bot to convert all the short form to long form URLs.

I've got my base62 decoder working. :D

Maybe consider a general purpose command line tool that converts a single URL from short to long that anyone can use for any purpose with its own Github page. I could see the utility for reasons other than Wikipedia. The bot then executes that tool.

It's going to verify every archive URL is working, properly formatted, and if using HTTPS when applicable.

This is what WaybackMedic 2 is doing on the complete corpus of Wayback links on en.Wikipedia. I had planned on sending you the SQL as before, if you want it, If your interested in a command line version of WM2 that can be called from your script let me know. Verifying archive URLs is not trivial and the code is done and highly tested, there are many issues.

In T143214#2571513, @Green_Cardamom wrote:

It's going to verify every archive URL is working, properly formatted, and if using HTTPS when applicable.

This is what WaybackMedic 2 is doing on the complete corpus of Wayback links on en.Wikipedia. I had planned on sending you the SQL as before, if you want it, If your interested in a command line version of WM2 that can be called from your script let me know. Verifying archive URLs is not trivial and the code is done and highly tested, there are many issues.

SQL batches are fine. I'm doing basic cleanup.

In T143214#2571480, @Green_Cardamom wrote:

Maybe consider a general purpose command line tool that converts a single URL from short to long that anyone can use for any purpose with its own Github page. I could see the utility for reasons other than Wikipedia. The bot then executes that tool.

This is just a one time throw away cleanup script. I wasn't planning on making a toolkit out of it.

In T143214#2571419, @Cyberpower678 wrote:

Essentially if the template can't extract the date and original URL from the archive URL it should throw an error, unless it is an unknown archive.

Have you-all settled on what constitutes a 'long' format url? The RFC seems to indicate that for archive.is, one form has a timestamp with dot separators and one form has a timestamp without separators; for webcite one long-form appears to be the same as the short-form with a url query string tacked on and the other is a query string with url and date keywords.

Which two of these four forms are the correct forms? Or, is it necessary for cs1|2 to support all of them?

In T143214#2572673, @Trappist_the_monk wrote:

In T143214#2571419, @Cyberpower678 wrote:

Essentially if the template can't extract the date and original URL from the archive URL it should throw an error, unless it is an unknown archive.

Have you-all settled on what constitutes a 'long' format url? The RFC seems to indicate that for archive.is, one form has a timestamp with dot separators and one form has a timestamp without separators; for webcite one long-form appears to be the same as the short-form with a url query string tacked on and the other is a query string with url and date keywords.

Which two of these four forms are the correct forms? Or, is it necessary for cs1|2 to support all of them?

My idea is that any form that allows for independent extrapolation of the original URL and the snapshot time stamp is acceptable and I was going to support all forms that have both of those items in them. As mentioned in the Webcite URL that 9 character ID is a base 62 number that contains the unix epoch time stamp in microseconds.

@Green_Cardamom On second thought, per your earlier comments, I'm going to drop the live checks on the snapshot.

Ok good idea. You've got a lot of code on your plate. And I discovered with WM it's not trivial. Plus given the size of the DB it would take a long time and lot of resources. It still needs to be done, though. I had some ideas and will pass them by you when I'm ready to start the project (if your still interested). i want to finish running WM2 first.

kaldari unsubscribed.Aug 26 2016, 9:49 PM

In T143214#2587508, @Green_Cardamom wrote:

Ok good idea. You've got a lot of code on your plate. And I discovered with WM it's not trivial. Plus given the size of the DB it would take a long time and lot of resources. It still needs to be done, though. I had some ideas and will pass them by you when I'm ready to start the project (if your still interested). i want to finish running WM2 first.

I'm going to make a simple interface for users to access the DB with. They will essentially login through OAuth. There will be a bot API, that you can then attach your bot to, where you simply forward your bot's OAuth tokens through an encrypted request, or send the header signature through to identify.

Forgive my ignorance, but I don't understand this bug. It looks like IABot is already using long-form archive URLs. For example, https://en.wikipedia.org/w/index.php?title=1715_in_Great_Britain&diff=prev&oldid=733890300.

• MZMcBride subscribed.Aug 26 2016, 10:28 PM

In T143214#2587671, @kaldari wrote:

Forgive my ignorance, but I don't understand this bug. It looks like IABot is already using long-form archive URLs. For example, https://en.wikipedia.org/w/index.php?title=1715_in_Great_Britain&diff=prev&oldid=733890300.

The wayback machine only supports one form. We're talking about archive.is and webcite. They have a ridiculous amount of URL forms that can be used for the same snapshot.

kaldari renamed this task from Have IABot use long form URLs. to Have IABot use long form URLs for archive.is and webcite.Aug 27 2016, 1:34 AM

@Green_Cardamom so for the past few days, I've been trying to research if there is some hidden meaning to the 5 character ID for archive.is snapshots, hoping to get something, when decoded, that helps to reveal some snapshot information. I've come up with nothing so far, and the documentation and support at archive.is is abysmal at best.

Well good idea to think they might encode data in the ID. Is it still possible to resolve with web scrape? Given a list of URLs a script should be able to build a translation table pretty quickly assuming they don't rate block.

In T143214#2588590, @Green_Cardamom wrote:

Well good idea to think they might encode data in the ID. Is it still possible to resolve with web scrape? Given a list of URLs a script should be able to build a translation table pretty quickly assuming they don't rate block.

That's what IABot does to resolve the data of archive.is, but my goal is to make the function resolve the data without have to scrape from the site when possible. It then uses that rational to determine whether or not the URL needs to be converted or not. I've completed the Webcite function. It can now take all kinds of webcite URLs and extract the data from it, but if it can't get both the snapshot time and the original URL from the archive URL, then it will cURL the API and mark it for conversion.

Sounds good. Does it mean IABot will do conversions from short to long of existing cases on Wikipedia it comes across, in addition of any in the DB currently in short form?

In T143214#2588604, @Green_Cardamom wrote:

Sounds good. Does it mean IABot will do conversions from short to long of existing cases on Wikipedia it comes across, in addition of any in the DB currently in short form?

Yes. That's why this bug is open for so long. I'm working on a thorough update, that will not need a cleanup script. A new flag, "convert_archive_url" is being added. When set, it will forcibly overwrite the data on the DB and change the archive URL on Wikipedia.

It's taking a while to cover all of the cases, and get the research on the URLs done.

That's great.

Really the best way. Users will keep adding short form and so it will need constant monitoring.

Cyberpower678 closed this task as Resolved.Aug 30 2016, 2:34 AM

Have IABot use long form URLs for archive.is and webciteClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Have IABot use long form URLs for archive.is and webcite
Closed, ResolvedPublic
Actions

Related Objects
Search...