Page MenuHomePhabricator

archivebot should ignore section headers within 'nowiki' segments (and commented out segments)
Closed, ResolvedPublic

Description

Example:

https://commons.wikimedia.org/w/index.php?title=Commons:Village_pump&diff=270820733&oldid=270794820

In the example above, essentially, the first half of

== Openstreetmap uploads ==
(snip)
<nowiki>
== {{int:license-header}} ==
(snip)
</nowiki>

was archived, leaving

== {{int:license-header}} ==
(snip)
</nowiki>

which doesn't make sense. The bot should either remove all of the lines above or leave all; it should not remove a half of them.

Event Timeline

The problem seems to be caused by this match (in load_page() of archivebot.py) which is not aware of nowiki or comments:

thread_header = re.search('^== *([^=].*?) *== *$', line)

It should ignore <pre> blocks as well (and probably any extension block, though it’s not likely that they contain section header codes).

@Tacsipacsi Sure, the fix should contain all possible exceptions. I looked into the code and I think this fix is not as easy as expected

Change 397803 had a related patch set uploaded (by Dvorapa; owner: Dvorapa):
[pywikibot/core@master] [bugfix] Make archivebot ignore headers in nowiki, pre, etc.

https://gerrit.wikimedia.org/r/397803

Dvorapa triaged this task as Medium priority.Dec 12 2017, 10:18 PM
Dvorapa raised the priority of this task from Medium to High.Dec 18 2017, 9:46 PM

Change 397803 merged by jenkins-bot:
[pywikibot/core@master] [bugfix] Make archivebot ignore headers in nowiki, pre, etc.

https://gerrit.wikimedia.org/r/397803

Now it sometimes misses sections. For example, hu:User talk:Wegyor’s sections from “Segítséget kértél, és én...” until “Családfa rajzokhoz...” (inclusive) are not discovered. I checked it with the following code:

class DiscussionPage(pywikibot.Page):
    ...
    def load_page(self):
        ...
        thread_titles = []
        for line_number, line in enumerate(lines, start=1):
            if line_number in thread_headers:
                thread_header = re.search('^== *([^=].*?) *== *$', line)
                thread_titles.append(thread_header.group(1))
                found = True  # Reading threads now
                if cur_thread:
                    self.threads.append(cur_thread)
                cur_thread = DiscussionThread(thread_header.group(1), self.now,
                                              self.timestripper)
            else:
                if found:
                    cur_thread.feed_line(line)
                else:
                    self.header += line + '\n'
        if cur_thread:
            self.threads.append(cur_thread)
        pywikibot.output('Thread titles: %s' % thread_titles)
        ...

The first section can not be archived, because it hasn't got a timestamp. Archive manually those with no timestamp or add timestamps to them and try again. If there will be some new bug, please fill a new task

The first section doesn’t have a timestamp, so it won’t be archived, but it is a section. The above code (which is the upstream code with three additional lines for debugging) doesn’t check whether the section has a timestamp or not. And the other two sections do have timestamps. It’s a regression of this patch, so it looked obvious for me to reopen this task for it.

For me the page hu:User talk:Wegyor works as expected. You can see its history. Have you got updated pywikibot and installed dependencies?

BTW the code you checked is only a half of the code responsible for the thread header recognition...

Of course, the actual archive bot works as expected, as it doesn’t use the latest code (it uses a customized version of the script until T72249 is fixed). Your test seemed also to work as expected, as the three missing sections were part of “Saját magamnak: ilyen volt”, which – concatenated with the other three – contained timestamps. What is the other half, which I haven’t tested? If you’re speaking about lines 471–474, it uses the same regex and even less input lines, so – what a surprise! – it gives the same result.

I don't understand what is wrong. For me everything is working with pages you listed. The whole code to test (you have to import _get_regexes too):

def load_page(self):
    """Load the page to be archived and break it up into threads."""
    self.header = ''
    self.threads = []
    self.archives = {}
    self.archived_threads = 0
    text = self.get()
    # Replace text in following exceptions by spaces, but don't change line
    # numbers
    exceptions = ['comment', 'code', 'pre', 'source', 'nowiki']
    exc_regexes = _get_regexes(exceptions, self.site)
    stripped_text = text
    for regex in exc_regexes:
        for match in re.finditer(regex, stripped_text):
            before = stripped_text[:match.start()]
            restricted = stripped_text[match.start():match.end()]
            after = stripped_text[match.end():]
            restricted = re.sub(r'[^\n]', r'', restricted)
            stripped_text = before + restricted + after
    # Find thread headers in stripped text and return their line numbers
    stripped_lines = stripped_text.split('\n')
    thread_headers = []
    for line_number, line in enumerate(stripped_lines, start=1):
        if re.search(r'^== *[^=].*? *== *$', line):
            thread_headers.append(line_number)
    # Fill self by original thread headers on returned line numbers
    lines = text.split('\n')
    found = False  # Reading header
    cur_thread = None
    for line_number, line in enumerate(lines, start=1):
        if line_number in thread_headers:
            thread_header = re.search('^== *([^=].*?) *== *$', line)
            found = True  # Reading threads now
            if cur_thread:
                self.threads.append(cur_thread)
            cur_thread = DiscussionThread(thread_header.group(1), self.now,
                                          self.timestripper)
        else:
            if found:
                cur_thread.feed_line(line)
            else:
                self.header += line + '\n'
    if cur_thread:
        self.threads.append(cur_thread)
    # This extra info is not desirable when run under the unittest
    # framework, which may be run either directly or via setup.py
    if pywikibot.calledModuleName() not in ['archivebot_tests', 'setup']:
        pywikibot.output(u'%d Threads found on %s'
                         % (len(self.threads), self))

Latest pywikibot commit I've tested this on (with no modifications):
24617f35d693b65e4d9bf14755dae8af835390ed

Command I've used for testing:

$ python pwb.py archivebot -lang:hu User:Cherybot/config -page:"..." -force -user:Dvorapa
Processing [[hu:...]]
xy Threads found on [[hu:...]]
Looking for: {{Szerkesztő:Cherybot/config}} in [[hu:...]]
Processing xy threads
Archiving yx thread(s).
Page [[...]] saved

If you use customized version, please be sure you ported the whole new functionality into your code (you can compare what was changed in https://gerrit.wikimedia.org/r/#/c/397803/11/scripts/archivebot.py):

New functionality you should port:
...
from pywikibot.textlib import _get_regexes
...

...
        text = self.get()
        # Replace text in following exceptions by spaces, but don't change line
        # numbers
        exceptions = ['comment', 'code', 'pre', 'source', 'nowiki']
        exc_regexes = _get_regexes(exceptions, self.site)
        stripped_text = text
        for regex in exc_regexes:
            for match in re.finditer(regex, stripped_text):
                before = stripped_text[:match.start()]
                restricted = stripped_text[match.start():match.end()]
                after = stripped_text[match.end():]
                restricted = re.sub(r'[^\n]', r'', restricted)
                stripped_text = before + restricted + after
        # Find thread headers in stripped text and return their line numbers
        stripped_lines = stripped_text.split('\n')
        thread_headers = []
        for line_number, line in enumerate(stripped_lines, start=1):
            if re.search(r'^== *[^=].*? *== *$', line):
                thread_headers.append(line_number)
        # Fill self by original thread headers on returned line numbers
        lines = text.split('\n')
        found = False  # Reading header
        cur_thread = None
        for line_number, line in enumerate(lines, start=1):
            if line_number in thread_headers:
                thread_header = re.search('^== *([^=].*?) *== *$', line)
                found = True  # Reading threads now
                if cur_thread:
                    self.threads.append(cur_thread)
                cur_thread = DiscussionThread(thread_header.group(1), self.now,
                                              self.timestripper)
            else:
                if found:
                    cur_thread.feed_line(line)
                else:
                    self.header += line + '\n'
...

And if your customized version fixes T72249, please help pywikibot and submit a patch

If you’re speaking about lines 471–474

You should test the whole part (from line 456)

Please provide an example of an unexpected behavior I could test on clean updated pywikibot installation. If there is something wrong I still miss, please file a new task and include:

  • version of python and pwb
  • steps to reproduce
    • code to test (if different from the original)
    • page where wrong behavior occurs
    • command you used

I used the current master code (with some lines added for printing section titles) for testing (so I didn’t have to port anything), with the -simulate option to not archive whole page if it shouldn’t be. The production bot uses a customized (old) version, in which {{függőben}} is hardcoded to prevent archiving (I can’t just submit a patch because of the hardcoding).

Change 409321 had a related patch set uploaded (by Whym; owner: Whym):
[pywikibot/core@master] archivebot: count removed characters when excluding comments etc

https://gerrit.wikimedia.org/r/409321

Change 409321 merged by jenkins-bot:
[pywikibot/core@master] archivebot: count removed characters when excluding comments etc

https://gerrit.wikimedia.org/r/409321