Page MenuHomePhabricator

Add an automated way to ProofreadPage extension populate imagelinks table
Closed, ResolvedPublic

Description

ProofreadPage uses images in some special ways do not always generate a record on imagelinks table. This behavior is well-known and has been addressed on some places to date as on Special:GlobalUsage @ Wikimedia Commons [1].

The absence of a record on imagelinks table not only breaks some statistics and monitoring tasks, but it also creates challenges for those interested to fork, mirror or pre-parse XML dumps for offline usage relying on imagelinks.sql.gz Wikisource wikis.

Please add an automated way to create records on imagelinks table for images used by ProfreadPage extension.


I've made two workarounds for pt.wikisource, but those workarounds are very hard to generate or maintain. For the record, [2] tracks individual files containing only one page digitized and [3] contains the multipage file formats.

pt.wikisource and the most majority of Wikisources already have a workaround for multipage files (rendering the cover or title page directly on Index: page) but this method isn't guaranteed to always work since some Index pages don't/won't render it for some unknown reason (maybe new users that don't known how to do it or even the absence of a scanned page to properly illustrate the digitization set?).


[1] - https://commons.wikimedia.org/wiki/MediaWiki:Globalusage-header
[2] - https://pt.wikisource.org/wiki/Utilizador:555/force_pagelinks_table_record/001
[3] - https://pt.wikisource.org/wiki/Utilizador:555/force_pagelinks_table_record/ext001

Event Timeline

555 created this task.Apr 30 2015, 12:14 AM
555 updated the task description. (Show Details)
555 raised the priority of this task from to Needs Triage.
555 added projects: ProofreadPage, Wikisource.
555 added a subscriber: 555.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 30 2015, 12:14 AM
GOIII added a subscriber: GOIII.Apr 30 2015, 2:36 AM
Aklapper triaged this task as Low priority.Apr 30 2015, 11:31 AM
Tpt moved this task from Backlog to Top priority on the ProofreadPage board.Jul 4 2015, 7:05 PM
Restricted Application added a subscriber: Steinsplitter. · View Herald TranscriptJul 4 2015, 7:05 PM

Change 222858 had a related patch set uploaded (by Tpt):
Adds Page: pages scan image to imagelinks

https://gerrit.wikimedia.org/r/222858

Change 222858 merged by jenkins-bot:
Adds Page: pages scan image to imagelinks

https://gerrit.wikimedia.org/r/222858

GOIII added a subscriber: Tpt.Jul 11 2015, 8:24 AM

@555, @Tpt -- Fixed?

Tpt added a comment.Jul 11 2015, 8:41 PM

@GOIII: no, the merged change only fix the bug for Page: pages. This issue is not solved yet for Index: pages.

GOIII updated the task description. (Show Details)Jul 12 2015, 12:35 AM
GOIII set Security to None.

Hi all, after completing T108799 we should now have the correct file usage for all pages of the wikisources. @Tpt, what is the current status for the Index pages? Is there anything i could do?

Index pages should all be good. I purged them all recently for another bug,
and @Mpaa did some code that pushes there update better during editing.

GOIII updated the task description. (Show Details)Dec 2 2015, 3:34 AM
GOIII added a comment.EditedDec 2 2015, 4:29 AM

All this is driving me to drink... especially now with all reported "weirdness" in Page: editing and WikiEditor drop outs - I think all that 'touching' was akin to shuffling the deck chairs on the Titanic (e.g. some files got fixed, some files got broke).

Going by the documentation for API:Purge, I'm under the impression that a simple purge (just &action=purge) guarantees nothing more than the purging the cache of the target file. We all know most of the time this is enough to "dislodge" most thumbnail and external linkage issues as a direct result (if not just a happy byproduct when executing a simple purge in of itself).

Still going by my understanding of the documentation for API:Purge's Parameters there are two seemingly relevant additional actions that can be initiated along with a simple purge. They are defined as....

  • forcelinkupdate: If set, updates the link tables
  • forcerecursivelinkupdate: Like forcelinkupdate, but also do forcelinkupdate on any page that transcludes the current page. This is akin to making an edit to a template. Note that the job queue is used for this operation, so there may be a slight delay when doing this for pages used a large number of times.

I consider the use of one or both of these along with &action=purge a complex or hard purge.

Now what I can't understand is why we (at least on Wikisource) are not using (or touching?) purges with the additional parameters that SPECIFICALLY state they also "update" link tables and/or transcluded pages? Why not intentionally cover all "three" factors at once instead of balling one thing around at a time like cat trying to hide it's turd on a marble floor?

Is it that executing this on the Commons hosted File: page has no effect on the Wikisource associated Index: and Page:s derived from that source file?

Is it that executing this on the Wikisource hosted Index: page only affects the derived Page: namespace pages but not the link-list on Commons?

Call me crazy but that same API documentation gives the solution to this issue by example as...

Purge file with inconsistent external links table on commons:

Wtf am I missing here?

555 added a comment.Dec 2 2015, 3:58 PM

Wtf am I missing here?

That touch.py is available only on the discontinued pywikibot branch and allow only null edits (neither the standart purge is available) and the module was run by an user without bot flag on all Wikisource subdomains, causing issues due to the noratelimit.

Weeks ago I've run the exactly same module under my bot flagged account on pt.Wikisource, fixing all remaining pages for that subdomain.

GOIII added a comment.Dec 2 2015, 11:43 PM
In T97613#1844897, @555 wrote:

Wtf am I missing here?

That touch.py is available only on the discontinued pywikibot branch and allow only null edits (neither the standart purge is available) and the module was run by an user without bot flag on all Wikisource subdomains, causing issues due to the noratelimit.

Weeks ago I've run the exactly same module under my bot flagged account on pt.Wikisource, fixing all remaining pages for that subdomain.

That's fine I guess... but what about the ~43,653 other Wikisourcerers who have no clue about modules, pywiki, python, never mind a simple purge? Are they forced to come hunt one of you good coder folks down every time they start seeing inaccurate link lists or mismatched thumbnails? Just make a request and somebody will get to it at some point? Sorry for the sarcasm but the point I'm trying to get across is that [in my view] this was not a solution even though things got fixed in the process.

Either way some clarification on who and what are affected by the various API based commands would be helpful in some way or degree, no?

In T97613#1844897, @555 wrote:

Wtf am I missing here?

That touch.py is available only on the discontinued pywikibot branch and allow only null edits (neither the standart purge is available) and the module was run by an user without bot flag on all Wikisource subdomains, causing issues due to the noratelimit.

Weeks ago I've run the exactly same module under my bot flagged account on pt.Wikisource, fixing all remaining pages for that subdomain.

Touch.py is mainstream. The touch component was updated to work better with bots and this was to overcome where the touch became more than a touch and became a character edit (due to other mediawiki factors). The purge component works beautifully, and has for a while.

GOIII added a comment.EditedDec 3 2015, 1:58 PM

Touch.py is mainstream. The touch component was updated to work better with bots and this was to overcome where the touch became more than a touch and became a character edit (due to other mediawiki factors). The purge component works beautifully, and has for a while.

Goody goody for you I guess??

Now please read 555's statement one more time and maybe the point about the lack of the OTHER purge parameters means all you are doing is refreshing the file cache and not forcing the rebuilding of the image link tables (nor refreshing anything being transcluded from said file).

.... only null edits (neither the standard [nor complex] purge is available) ...

I get the impression those actions are initiated by happenstance thanks to the simple purge from your 'touch' -- meaning you've fixed some of the previous incomplete link table and thumbnail mismatch issues but my bet is not all of them. Even worse, file "states" that were fine before the touch may now be suffering from the same issue(s) unwittingly.

Why in blazes do you think ShakespeareFan still keeps posting every other day about page thumbnails still not matching the expected page progression/page content until someone else comes along and manually purges the Index:/File: - re-starting the whole purge process again?

So it would appear one needs to force a recursive link update along with purging the file cache to insure no such disparity is inadvertently created by just a simple purge/null edit. And all I'm saying there is no reason this ever need to reach sysop or BOT level interaction when [per the API documentation] there must be a way to execute the complex urge just as easily as it for users to execute the simple purge now.

@GOIII byte me. I simply commented on touch.py, nothing else. Being argumentative is not helpful

My touching files was to get the image links onto Commons for the pages as per the ticket, with a tool expressed to be functional to the task. If it edited rather than touched, then it is no different as an edit as any other edit, to my understanding an edit is an edit. I cannot respond to any assertion that you make, and not currently in a position to run a series of tests.

Re SF, I don't know and I really don't care. I try to separate myself from them completely.

Change 354519 had a related patch set uploaded (by Tpt; owner: Tpt):
[mediawiki/extensions/ProofreadPage@master] Adds used file to <pagelist> dependencies

https://gerrit.wikimedia.org/r/354519

Change 354519 merged by jenkins-bot:
[mediawiki/extensions/ProofreadPage@master] Adds used file to <pagelist> dependencies

https://gerrit.wikimedia.org/r/354519

Tpt added a comment.May 26 2017, 8:53 AM

I belive this issue is now solved (with a big of cache purge required probably). @Billinghurst could you confirm?

Billinghurst added a comment.EditedMay 26 2017, 2:14 PM

@Tpt I am not sure what more is expected to display on a File: page for usage. On new additions, and older works, I am seeing full linking of pages that are utilised.

Recent addition

Old addition

If there is something not using global usage properly then @555 or someone else will need to point to examples. We can wikisource-bot through numbers of wikisources today, if we are talking Commons, we will need to get a bot right for any larger/quick functions for the high editing rate allocation.

Tpt closed this task as Resolved.Nov 15 2017, 10:20 AM
Tpt claimed this task.

It seems to me that is problem is now solved.