Page MenuHomePhabricator

XMLDumpPageGenerator is still not working
Closed, ResolvedPublicBUG REPORT

Description

I just installed the latest master release of pywikibot (7.2.0.dev0) on a brand new Windows 10 with the latest Python (3.10.4).
Executing a simple search and replace like this one over a dump...

python pwb.py replace.py -xml:itwiki-20220401.xml -ns:0 -lang:it "meteorite" ""

...it will just feed any page within the dump even without that word.

Retrieving 50 pages from wikipedia:it.
No changes were necessary in [[Organo a pompa]]
No changes were necessary in [[Antropologia]]
No changes were necessary in [[Agricoltura]]
No changes were necessary in [[Architettura]]
No changes were necessary in [[Astronomia]]
No changes were necessary in [[Archeologia subacquea]]
No changes were necessary in [[Analisi delle frequenze]]
No changes were necessary in [[Aerofoni]]
No changes were necessary in [[Arte]]
[...]

Event Timeline

@Basilicofresco: are you sure any of the skipped pages has that "meteorite" in its content?

None of the skipped page has that word. The point of running replace.py on a dump should be to load only the pages with that word and not just any page.

None of the skipped page has that word. The point of running replace.py on a dump should be to load only the pages with that word and not just any page.

The xml dump was filtered by the replacements first and the same filtering and replacements were made later when processing the pages. This could lead to the situation that pages were not processed when the page content of the xml file was outdated and filtered out. The new behaviour takes this into account and wents back to a default implementation of pagegenerators. Maybe this can be improved when using text_predicate or some additional filtering can be implemented with the GeneratorFactory.

These articles never contained the word "meteorite".
Moreover "Organo a pompa" is the very first article written in the current itwiki-20220401-pages-articles.xml dump, "Antropologia" is the second one, "Agricoltura" the third one, etc.
The problem is that it is not filtering at all the xml by replacements, it is just listing one by one every single page present in the dump.

The problem is that it is not filtering at all the xml by replacements, it is just listing one by one every single page present in the dump.

Why is this a problem?

Change 780808 had a related patch set uploaded (by Xqt; author: Xqt):

[pywikibot/core@master] [IMPR] Remove message when no change was made

https://gerrit.wikimedia.org/r/780808

Well, probably I did not express myself well.
The whole point of using the dump with replace.py is to rapidly filter the xml by replacements in order to speed up the process of replacing something with something else on the whole ns:0. Replace.py used to work in this way since at least 15 years.
At the moment it's just listing any page within the dump even if these pages does not contain the word "meteorite" and it is a problem because in this way the use of a big dumps become pointless. I could do the same without the dump just executing something like

python pwb.py replace.py -start:* -ns:0 -lang:it "meteorite" ""

Moreover the message No changes were necessary in [[page]] should not be removed because it is useful. Thanks to it I'm aware that a page contained at the time of the dump something that triggered the replacements, but now it doesn't trig them anymore. I used this information in the past for a wide range of uses... understanding if someone else was fixing the same problem, testing if an improved version of a regex is still catching some pages, rapidly listing pages that I already fixed for a specific regex and many more.

Change 780854 had a related patch set uploaded (by Xqt; author: Xqt):

[pywikibot/core@master] [IMPR] Speed up XMLDumpPageGenerator

https://gerrit.wikimedia.org/r/780854

Well, probably I did not express myself well.
The whole point of using the dump with replace.py is to rapidly filter the xml by replacements in order to speed up the process of replacing something with something else on the whole ns:0. Replace.py used to work in this way since at least 15 years.

Ok I'll make some measurements with the old and the new implementation shortly. Anyway I found out that the implementation of the XMLDumpPageGenerator is very slow (old and new implementation of replace.py) and there is no benefit over -start:! yet. In result I made a new patch to circumvent this problem.

Ok, thanks. And keep in mind that speed matters when you have to montly check the ns:0 with hundreds of regexes. Many active bots on Wikipedia, I believe a good part of them, are actually using the dumps. So the efficiency should be as good as possibile... we are talking about days of cpu at 100%. Thanks for understanding!

Xqt triaged this task as High priority.Apr 14 2022, 4:03 PM

I see it is important to filter before process the page

XmlDumpReplacePageGenerator old XMLDumpPageGenerator new XMLDumpPageGenerator-start:An option used
filter is made for each dump entryno filtering is made before processingno filtering is made before processing but entry.text is not assignedno filtering is made before processing
55697 entries processed, 12 pages found to process271 pages processed until first edit271 pages processed until first edit3625 pages processed until first edit
1 second194 seconds57 seconds60 seconds

Unfortunately the old redirect XmlDumpReplacePageGenerator implementation raises an exception when parsing whereas the pagegenerators implementation does not (maybe because the script was halted after the first 271 pages):

ERROR: ParseError: no element found: line 328465, column 355
Traceback (most recent call last):
  File "C:\pwb\GIT\core\pwb.py", line 496, in <module>
    main()
  File "C:\pwb\GIT\core\pwb.py", line 480, in main
    if not execute():
  File "C:\pwb\GIT\core\pwb.py", line 463, in execute
    run_python_file(filename, script_args, module)
  File "C:\pwb\GIT\core\pwb.py", line 143, in run_python_file
    exec(compile(source, filename, 'exec', dont_inherit=True),
  File ".\scripts\replace.py", line 1107, in <module>
    main()
  File ".\scripts\replace.py", line 1103, in main
    bot.run()
  File "C:\pwb\GIT\core\pywikibot\bot.py", line 1555, in run
    for item in self.generator:
  File "C:\pwb\GIT\core\pywikibot\pagegenerators.py", line 2240, in PreloadingGenerator
    for page in generator:
  File "C:\pwb\GIT\core\pywikibot\pagegenerators.py", line 1761, in <genexpr>
    return (page for page in generator if page.namespace() in namespaces)
  File ".\scripts\replace.py", line 435, in __iter__
    for entry in self.parser:
  File "C:\pwb\GIT\core\pywikibot\xmlreader.py", line 119, in parse
    for event, elem in context:
  File "C:\Python310\lib\xml\etree\ElementTree.py", line 1260, in iterator
    root = pullparser._close_and_return_root()
  File "C:\Python310\lib\xml\etree\ElementTree.py", line 1307, in _close_and_return_root
    root = self._parser.close()
xml.etree.ElementTree.ParseError: no element found: line 328465, column 355
CRITICAL: Exiting due to uncaught exception <class 'xml.etree.ElementTree.ParseError'>

Change 780890 had a related patch set uploaded (by Xqt; author: Xqt):

[pywikibot/core@master] Revert "[IMPR] use pg.XMLDumpPageGenerator in replace.py"

https://gerrit.wikimedia.org/r/780890

@Basilicofresco: I've reverted using pagegeneratory XMLDumpPageGenerator to speed up replace.py. I think my dump is corrupt but that ParseError should be fixed anyway.

Change 780890 merged by jenkins-bot:

[pywikibot/core@master] Revert "[IMPR] use pg.XMLDumpPageGenerator in replace.py"

https://gerrit.wikimedia.org/r/780890

Change 781565 had a related patch set uploaded (by Xqt; author: Xqt):

[pywikibot/core@stable] [7.1.1] Fix regression of XmlDumpPageGenerator

https://gerrit.wikimedia.org/r/781565

Change 781565 merged by jenkins-bot:

[pywikibot/core@stable] [7.1.1] Fix regression of XmlDumpPageGenerator

https://gerrit.wikimedia.org/r/781565

There are some open issues

Change 782196 had a related patch set uploaded (by Xqt; author: Xqt):

[pywikibot/core@master] [IMPR] handle ParserError within xmlreader.XmlDump.parse()

https://gerrit.wikimedia.org/r/782196

Xqt changed the subtype of this task from "Task" to "Bug Report".Apr 16 2022, 3:15 PM
Xqt added a project: Regression.

Change 780808 merged by jenkins-bot:

[pywikibot/core@master] [IMPR] add -quiet option to omit message when no change was made

https://gerrit.wikimedia.org/r/780808

Change 780854 merged by jenkins-bot:

[pywikibot/core@master] [IMPR] Deprecate XMLDumpOldPageGenerator in favour of a 'content' parameter

https://gerrit.wikimedia.org/r/780854

Change 782196 merged by jenkins-bot:

[pywikibot/core@master] [IMPR] handle ParserError within xmlreader.XmlDump.parse()

https://gerrit.wikimedia.org/r/782196