Feature: Cosmetic Edit function
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Green_Cardamom
	Dec 2 2018, 6:51 PM

Description

Sometimes simple cleanup tasks are cosmetic ie. not strictly required, but nice to have done at some point. Cosmetic edits are included only when other edits are done at the same time.

This is a request for a Cosmetic Edit task to remove the trailing "#" from archiveurl and url fields - the later only if it's also in the archiveurl. This will reduce complexity in the source data, reduce chances of future problems with other tools and services, bring the URLs back to normal.

Event Timeline

Green_Cardamom created this task.Dec 2 2018, 6:51 PM

Restricted Application assigned this task to Cyberpower678. · View Herald TranscriptDec 2 2018, 6:51 PM

An example of an unintended side effect with the trailing "#" is with archive.today

In this diff:

https://en.wikipedia.org/w/index.php?title=James_Dyson_Award&type=revision&diff=876968890&oldid=834212150

This URL was saved at archive.today

http://au.ibtimes.com/articles/250141/20111116/aussie-wins-james-dyson-award-airdrop.htm#.UJv8z4aN1yI

In order to retrieve the URL one must do

https://archive.today/20130126051359/http://au.ibtimes.com/articles/250141/20111116/aussie-wins-james-dyson-award-airdrop.htm%23.UJv8z4aN1yI

This is because archive.today converts the "#" into a literal encoded character that is part of the URL not a special reserved character indicating a fragment.

Thus all the URLs with trailing "#" that are being saved at archive.today can only be discovered there by adding a "%23" to the end of the URL.

Cosmetic function:

Convert archive.is, .fo, .li, .vn, .md, .ph

--> archive.today

It is not necessary to function, but could be one day, and the site prefers we use .today

Idea is chip away at conversion now to mitigate a possible large/emergency conversion later.

Cyberpower678 moved this task from Inbox to v2.0 on the InternetArchiveBot board.Mar 6 2019, 10:50 PM

Cyberpower678 edited projects, added InternetArchiveBot (v2.0); removed InternetArchiveBot.

I'm a little confused. The blank # should automatically clean out if the URL gets processed again. This is because it detaches the fragment internally, and if it's empty, will not re-attach it. As for the converting the domains of archiveis, I believe this is forced on enwiki.

This is for preexisting links in wikisource. The edit occurs in wikisource. Wikisource is loaded with archive.is links that should be converted to archive.today - likewise many articles contain URLs that end in # due to the old bug.

Wikisource? Does IABot even edit there. I thought it edits Wiktionary.

For example, if an article contains the following:

<ref>{{cite web |url=http://example.com |archiveurl=https://archive.is/201902020101/http://example.com |archivedate=2019-02-02 |deadurl=yes}}</ref>

IABot arrives and it sees it, but normally does nothing because there is nothing to do. The citation is correct, the archive URL and date are correct, etc.. However with this cosmetic function, it would make an edit so that the article now contains:

<ref>{{cite web |url=http://example.com |archiveurl=https://archive.today/201902020101/http://example.com}}</ref>

It would modify the article (the wiki source code, the wiki markup, whatever it is called). Edit the article such that https://archive.is is now https://archive.today

IABot should already do that. It's a part of the normalization function which is enabled on enwiki.

No it does not normalize archive.is URLs to archive.today as in the above example.

It will only normalize when converting from short-form to long-form.

Cirdan moved this task from Unsorted to New feature on the InternetArchiveBot (v2.0) board.May 12 2019, 10:02 AM

Harej removed Cyberpower678 as the assignee of this task.Apr 22 2020, 1:14 AM

Harej added a subscriber: Cyberpower678.

This should long be resolved.

Feature: Cosmetic Edit functionClosed, ResolvedPublicActions

Description

Event Timeline

Feature: Cosmetic Edit function
Closed, ResolvedPublic
Actions