Page MenuHomePhabricator

Create a API server on tool that can replace 404-links.txt's reflinks.py file
Open, LowestPublic

Description

In this change https://gerrit.wikimedia.org/r/#/c/393100/ @Zoranzoki21 showed the problem of downloading a large archive to use reflinks.py.
It is therefore proposed to create a web server, for which reflinks.py could interact by API.

Event Timeline

Framawiki created this task.Dec 4 2017, 6:58 PM
Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald TranscriptDec 4 2017, 6:58 PM
Zoranzoki21 added a comment.EditedDec 4 2017, 7:00 PM

In each instalation, to user can use reflinks script, have to have this file in main folder.. I support idea for creating web server, with which reflinks.py will interact by API. Than user(s) will not need to every time download this file.

And too, file is not updated much years. Last update of file is 2007.

Isn't the dead link scanning now done by IABot?

What is reflinks?

Zoranzoki21 added a comment.EditedDec 4 2017, 8:51 PM

Isn't the dead link scanning now done by IABot?

But why reflinks can not work without this file than?

Change 395094 had a related patch set uploaded (by Zoranzoki21; owner: jenkins-bot):
[pywikibot/core@master] Disable needing text file for running reflinks.py script

https://gerrit.wikimedia.org/r/395094

Change 395094 abandoned by Zoranzoki21:
Disable needing text file for running reflinks.py script

Reason:
What happened?

https://gerrit.wikimedia.org/r/395094

@Zoranzoki21, @zhuyifei1999, @Cyberpower678: reflinks.py is a Python script, which goes through bare links in a wiki page (pages) and finds out more details (page title, mime type). This script uses a pre-generated text file containing 404 links (gathered from wiki pages) to avoid marking temporarily unaccessible links as dead links. The text file could be downloaded from script-author's webpage, but it is old and unmaintained there and contains only enwiki articles. It can also be newly created by yourself using Python script weblinkchecker.py, but it takes a week to create this list (in order to eliminate temporarily unaccessible links). Currently we are looking for a better solution.

Change 395095 had a related patch set uploaded (by Zoranzoki21; owner: Zoranzoki21):
[pywikibot/core@master] reflinks.py: Disable needing 404-links.txt for running script

https://gerrit.wikimedia.org/r/395095

Zoranzoki21 added a comment.EditedDec 4 2017, 9:20 PM

@Zoranzoki21, @zhuyifei1999, @Cyberpower678: reflinks.py is a Python script, which goes through bare links in a wiki page (pages) and finds out more details (page title, mime type). This script uses a pre-generated text file containing 404 links (gathered from wiki pages) to avoid marking temporarily unaccessible links as dead links. The text file could be downloaded from script-author's webpage, but it is old and unmaintained there and contains only enwiki articles. It can also be newly created by yourself using Python script weblinkchecker.py, but it takes a week to create this list (in order to eliminate temporarily unaccessible links). Currently we are looking for a better solution.

I thinking on one solution. To I run weblinkchecker from toolforge. And then we will got clean updated file.

I will create file with deadlinks from srwiki. I started: http://prntscr.com/hj4pjq

You know I have a container web script that can run checks on an array of links. I can set it up.

You know I have a container web script that can run checks on an array of links. I can set it up.

I know.

You know I have a container web script that can run checks on an array of links. I can set it up.

Tommorrow I will start rfc on srwiki to enable IAB.

Change 395095 abandoned by Zoranzoki21:
reflinks.py: Disable needing 404-links.txt for running script

https://gerrit.wikimedia.org/r/395095

@Cyberpower678 I started rfc on srwiki for enabling InternetArchiveBot there.

I will for 7 days, if is all ok, create task for it.

Xqt triaged this task as Lowest priority.Dec 6 2017, 8:16 AM