Page MenuHomePhabricator

weblinkchecker: Use https instead of http for web.archive.org
Closed, ResolvedPublic

Description

See parent task and @Zoranzoki21's comment on https://gerrit.wikimedia.org/r/#/c/357849/

Event Timeline

I modifided code in Saturday, and I tested, but it do not work. Only work translate on Serbian which I add and my modification to not adding 3 blank lines per report, to add one blank line per report.

See my comment on gerit

As I can see the problem comes from the _get_closest_memento_url function that uses memento_client.MementoClient(), that is an extern library : https://github.com/mementoweb/py-memento-client

>>> import datetime
>>> import memento_client
>>> mc = memento_client.MementoClient()
>>> when = datetime.datetime.now()
>>> url = 'http://www.fallingrain.com/world/YI/2/Dunisice.html'
>>> memento_info = mc.get_memento_info(url, when)
>>> memento_info
{'mementos': {'last': {'uri': ['http://web.archive.org/web/20110228040245/http://www.fallingrain.com:80/world/YI/2/Dunisice.html'], 'datetime': datetime.datetime(2011, 2, 28, 4, 2, 45)}, 'closest': {'datetime': datetime.datetime(2011, 2, 28, 4, 2, 45), 'uri': [u'http://web.archive.org/web/20110228040245/http://www.fallingrain.com:80/world/YI/2/Dunisice.html'], 'http_status_code': 404}, 'first': {'uri': ['http://web.archive.org/web/20071001061940/http://www.fallingrain.com/world/YI/2/Dunisice.html'], 'datetime': datetime.datetime(2007, 10, 1, 6, 19, 40)}}, 'original_uri': 'http://www.fallingrain.com/world/YI/2/Dunisice.html', 'timegate_uri': 'http://timetravel.mementoweb.org/timegate/http://www.fallingrain.com/world/YI/2/Dunisice.html'}
>>> mementos = memento_info.get('mementos')
>>> mementos
{'last': {'uri': ['http://web.archive.org/web/20110228040245/http://www.fallingrain.com:80/world/YI/2/Dunisice.html'], 'datetime': datetime.datetime(2011, 2, 28, 4, 2, 45)}, 'closest': {'datetime': datetime.datetime(2011, 2, 28, 4, 2, 45), 'uri': [u'http://web.archive.org/web/20110228040245/http://www.fallingrain.com:80/world/YI/2/Dunisice.html'], 'http_status_code': 404}, 'first': {'uri': ['http://web.archive.org/web/20071001061940/http://www.fallingrain.com/world/YI/2/Dunisice.html'], 'datetime': datetime.datetime(2007, 10, 1, 6, 19, 40)}}
>>> mementos['closest']['uri'][0]
u'http://web.archive.org/web/20110228040245/http://www.fallingrain.com:80/world/YI/2/Dunisice.html'

Confirmed.
I'll create a task on the mementos's gihthub repo, and upload a temporary hacky patch for pywikibot.

Change 358053 had a related patch set uploaded (by Framawiki; owner: Framawiki):
[pywikibot/core@master] [bugfix] weblinkchecker.py: Use https for web.archive.org

https://gerrit.wikimedia.org/r/358053

Change 358053 merged by jenkins-bot:
[pywikibot/core@master] [bugfix] weblinkchecker.py: Use https for web.archive.org

https://gerrit.wikimedia.org/r/358053

If is that resolved, can I now test to see effect?

Framawiki changed the task status from Open to Stalled.Jun 10 2017, 5:31 PM

@Zoranzoki21 To get the last version of pywikibot, you have ti use Git: https://www.mediawiki.org/wiki/Manual:Pywikibot/Gerrit. Don't hesitate to tell me if I can help you to use it.

NOTE: This task is not solved, we currently use a local hack.

I think on selected in green color.

What are we supposed to see there? There doesn't seem to be a web.archive.org link there?

Kizule changed the task status from Stalled to Open.Jun 13 2017, 6:49 PM

Hmmm. Now I downloaded script weblinkchecker from http://tools.wmflabs.org/pywikibot/ and.. There have not https. See here.

Aha. It seems getInternetArchiveURL uses https://archive.org/wayback/available?url=http://nl.wikipedia.org/, which also still returns an http link. So the search-and-replace should be moved to setLinkDead.

Ok. Can changing to https?

The tool uses also other service providers than archive.org

Thé tool use also other service providers than archive.org

Ok

Aha. It seems getInternetArchiveURL uses https://archive.org/wayback/available?url=http://nl.wikipedia.org/, which also still returns an http link. So the search-and-replace should be moved to setLinkDead.

Oh, ok, I'll look at it too.

Good news: The issue is fixed with the library, so I'll look if we can revert my hacky patch.

I have not https. I replaced with my bot in all articles on serbian wikipedia http to https for webarchive. I started script, but i have not https.. See: https://sr.wikipedia.org/wiki/%D0%A0%D0%B0%D0%B7%D0%B3%D0%BE%D0%B2%D0%BE%D1%80:D%27Ilio,_Chieti

@Zoranzoki21 To get the last version of pywikibot, you have ti use Git: https://www.mediawiki.org/wiki/Manual:Pywikibot/Gerrit. Don't hesitate to tell me if I can help you to use it.

NOTE: This task is not solved, we currently use a local hack.

I updated now, but i have not https

Change 380923 had a related patch set uploaded (by Zoranzoki21; owner: Zoranzoki21):
[pywikibot/core@master] [bugfix] weblinkchecker.py: Use https for web.archive.org

https://gerrit.wikimedia.org/r/380923

Change 380923 abandoned by Zoranzoki21:
[bugfix] weblinkchecker.py: Use https for web.archive.org

Reason:
All checks on github failed.. And I do not know how to make permanent fix for https

https://gerrit.wikimedia.org/r/380923

As the issue with memento is solved, is this issue solved too? Is the hacky patch reverted? Can this be marked as resolved?

As the issue with memento is solved, is this issue solved too? Is the hacky patch reverted? Can this be marked as resolved?

This is not fixed.

Change 806554 had a related patch set uploaded (by Xqt; author: Xqt):

[pywikibot/core@master] Revert: [bugfix] weblinkchecker.py: Use https for web.archive.org

https://gerrit.wikimedia.org/r/806554

Xqt claimed this task.

Change 806554 merged by jenkins-bot:

[pywikibot/core@master] Revert: [bugfix] weblinkchecker.py: Use https for web.archive.org

https://gerrit.wikimedia.org/r/806554