Ql filter attempts to process 'non-standard' titles for pages with Proofread-page content model...
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ShakespeareFan00
	May 10 2022, 1:25 PM

Description

List of steps to reproduce (step by step, including full links if applicable):

Open a PAWS 'bash' notebook
enter the command line

pwb.py listpages -usercontribs:"ShakespeareFan00" -ql:4  -intersect -ns:104 -grep:"\<\!\-\-" -lang:en -family:wikisource   > comments_l4.txt

Run the command

What happens?:

It will run for an extended period - eventually crashing out with an error:-

Traceback (most recent call last):
  File "/srv/paws/pwb/pwb.py", line 487, in <module>
    main()
  File "/srv/paws/pwb/pwb.py", line 471, in main
    if not execute():
  File "/srv/paws/pwb/pwb.py", line 454, in execute
    run_python_file(filename, script_args, module)
  File "/srv/paws/pwb/pwb.py", line 143, in run_python_file
    exec(compile(source, filename, 'exec', dont_inherit=True),
  File "/srv/paws/pwb/scripts/listpages.py", line 282, in <module>
    main()
  File "/srv/paws/pwb/scripts/listpages.py", line 254, in main
    for i, page in enumerate(gen, start=1):
  File "/srv/paws/pwb/pywikibot/pagegenerators.py", line 1991, in <genexpr>
    return (page for page in generator
  File "/srv/paws/pwb/pywikibot/pagegenerators.py", line 2237, in PreloadingGenerator
    for page in generator:
  File "/srv/paws/pwb/pywikibot/pagegenerators.py", line 2009, in QualityFilterPageGenerator
    page = ProofreadPage(page)
  File "/srv/paws/pwb/pywikibot/proofreadpage.py", line 214, in __init__
    self._base, self._base_ext, self._num = self._parse_title()
  File "/srv/paws/pwb/pywikibot/proofreadpage.py", line 249, in _parse_title
    num = int(right)
ValueError: invalid literal for int() with base 10: 'marginals'
CRITICAL: Exiting due to uncaught exception <class 'ValueError'>

What should have happened instead?:

There should have been a list of titles generated in the piped file, Or a count of 0 files found shown in the notebook output.

Software version (if not a Wikimedia wiki), browser information, screenshots, other information, etc.:

Firefox Nightly, Current Juptyer Lab : Version 3.3.4

Details

	Subject	Repo	Branch	Lines +/-
	[IMPR] raise InvalidTitleError instead of ValueError in ProofreadPage	pywikibot/core	master	+6 -2

Customize query in gerrit

Event Timeline

ShakespeareFan00 created this task.May 10 2022, 1:25 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 10 2022, 1:25 PM

Ladsgroup added a project: Pywikibot.May 10 2022, 1:26 PM

Ladsgroup removed a subscriber: Pywikibot.

Restricted Application added a subscriber: pywikibot-bugs-list. · View Herald TranscriptMay 10 2022, 1:26 PM

Xqt triaged this task as High priority.May 10 2022, 2:10 PM

Xqt added a subscriber: Mpaa.

Looks like this page leads to this problem:
https://en.wikisource.org/wiki/Page:Ruffhead_-_The_Statutes_at_Large,_1763.djvu/96/marginals

Is this a valid proofread title?

Xqt raised the priority of this task from High to Needs Triage.May 10 2022, 2:41 PM

It's a subpage. It was created to hold the extensive sidetitles (which mediawiki can't support natively) on the parent-page, so that there was more flexibility as to how they could be presented on different devices. Wikisource has templates like AuthorityReference to read in footnotes from an external page. However that template can't be used (due to limitations in how the Labelled Section Transclusion works) to do the same thing for content that would be on the "same" page., Hence the marginals are on a subpage.. I will note that the approach taken here was highly experimental in nature, and not finalised.

The page has the same 'model' as other proofread pages, in that it should contain a pagequality string to look for.

(A long term solution to the issue of sidenotes/sidetitles would be an entirely different ticket (most likely as an update to the Cite Extension) to permit their use without the 'too clever to be stable' workarounds applied here...)

This is not a standard way of working with Proofread pages, and to handle it it would create several inconveniences (e.g. how many pages will have the related index? etc.)
Before acting on pywikibot, there should be an agreement in the wikisource world on such subpages.

Until that, the only reasonable thing to do might be to mitigate the issue by emitting a warning and discarding the page (if theare are no major drawbacks in doing so).

I am perfectly happy with the -ql filter discarding stuff it can't process with a warning...

@Mpaa _ I've re-integrated the contents of the sub-page into the root, using the standard sidenotes approach, and once the sub page is removed, this ticket can be closed, as it's very obscure edge case, unlikely to be encountered very often.

Setting this as low priority.. It's a singular edge case on a specific local project.

However, it does make me wonder if there is a need for the filter to be more robust about how it processes titles that aren't in the standard format for the Proofread-page content model.

ShakespeareFan00 renamed this task from Incompatible -ql and -grep filter options for listpages.py... to Ql filter attempts to process 'non-standard' titles for pages with Proofread-page content model... .May 10 2022, 10:49 PM

In T308016#7919250, @ShakespeareFan00 wrote:

Setting this as low priority.. It's a singular edge case on a specific local project.

Which was apparently resolved by removing the page concerned.

In T308016#7919265, @ShakespeareFan00 wrote:

However, it does make me wonder if there is a need for the filter to be more robust about how it processes titles that aren't in the standard format for the Proofread-page content model.

I do not think so, as the filter is supposed to work only on pages that are in the standard format for the Proofread-page content model.

So this isn't actually a 'bug' , but a working as designed.
Is there a mechanism for checking if there are other 'non-standard' pages in Page: namespace?

Change 791771 had a related patch set uploaded (by Xqt; author: Xqt):

[pywikibot/core@master] [IMPR] raise InvalidTitleError instead of ValueError in ProofreadPage

https://gerrit.wikimedia.org/r/791771

gerritbot added a project: Patch-For-Review.May 15 2022, 7:37 AM

In T308016#7928877, @ShakespeareFan00 wrote:

So this isn't actually a 'bug' , but a working as designed.

Yes but the exception should be more speaking. I made a patch for it

Is there a mechanism for checking if there are other 'non-standard' pages in Page: namespace?

Don't know one which is not greedy.

Xqt claimed this task.May 15 2022, 10:08 AM

Xqt changed the subtype of this task from "Bug Report" to "Task".

Xqt closed this task as Resolved.May 15 2022, 11:17 AM

Change 791771 merged by jenkins-bot: