Page MenuHomePhabricator

reimplement imageuncat's -recentchanges as a pagegenerators command line argument
Closed, ResolvedPublic

Description

Similar to T67192, but for -recentchanges and its recentChanges generator.

While this is not a one-liner, it is marked as good first task because T67192 provides a good example of the changes, and it is quite easy to test.

Event Timeline

@jayvdb
There is already an existing -recentchanges:x command line argument in pagegenerators, which returns the x most recently changed pages (if x not mentioned, then x=60), while the function of -recentchanges in imageuncat is to return pages changed between 70 and 120 minutes ago.

Would you like me to implement a new command line argument in pagegenerators (maybe -recentchanges-timespan or -recentchanges:timespan or something similar) for imageuncat's functionailty?

Or I could modify the existing -recentchanges in different ways:

  • Accept -recentchanges:x-y or -recentchanges-timespan:x-y such that x is the delay, and y is the block, and both are positive (need to think of error messages if they aren't positive )
  • Accept -recentchanges:timespan:x where x is the delay and the block is kept as the default 70 (x is a positive int)
  • Accept -recentchanges:timespan:x where x is the block and the timespan is kept as the default 120 (x is a positive int)

This task is about improving pagegenerators' -recentchanges . Please see T67192 and the code used to fix it.

This task is about improving pagegenerators' -recentchanges . Please see T67192 and the code used to fix it.

I did go through that task and the corresponding code. :)
My doubt was merely if you wanted a modification of the existing parameter, or the creation of a new one, since the task's wording is 'reimplement'.

Since you have now mentioned 'improving', and since the quoted task also involves modification of an existing parameter, I take it that we have to modify -recentchanges such that it accepts both -recentchanges:x and -recentchanges:delay,block.

I will go ahead with that. Please correct me if I am wrong?
Thank you. :)

Improving the existing pagegenerators -recentchanges seems the best approach

Change 276535 had a related patch set uploaded (by Darthbhyrava):
Change: Reimplement imageuncat's -recentchanges as a pagegenerators CL arg

https://gerrit.wikimedia.org/r/276535

I've added a patch, but am not sure about the tests. Could someone please take a look?

pagegenerators_tests throws the following error:

$ nosetests -v tests.pagegenerators_tests

ERROR: Test recentchanges generator with offset and duration params.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/darthbhyrava/Dev/GSoC/pywikibot/core/tests/pagegenerators_tests.py", line 809, in test_recentchanges_timespan
    gf.handleArg('-recentchanges:120,70')
  File "/home/darthbhyrava/Dev/GSoC/pywikibot/core/pywikibot/pagegenerators.py", line 704, in handleArg
    _filter_unique=self._filter_unique)
  File "/home/darthbhyrava/Dev/GSoC/pywikibot/core/pywikibot/tools/__init__.py", line 1358, in wrapper
    return obj(*__args, **__kw)
  File "/home/darthbhyrava/Dev/GSoC/pywikibot/core/pywikibot/pagegenerators.py", line 1144, in RecentChangesPageGenerator
    user=user, excludeuser=excludeuser)
  File "/home/darthbhyrava/Dev/GSoC/pywikibot/core/pywikibot/tools/__init__.py", line 1358, in wrapper
    return obj(*__args, **__kw)
  File "/home/darthbhyrava/Dev/GSoC/pywikibot/core/pywikibot/site.py", line 4368, in recentchanges
    self.assert_valid_iter_params('recentchanges', start, end, reverse)
  File "/home/darthbhyrava/Dev/GSoC/pywikibot/core/pywikibot/site.py", line 2240, in assert_valid_iter_params
    "%s: start must be later than end with reverse=False" % msg_prefix)
Error: recentchanges: start must be later than end with reverse=False

Should I swap the rcstart and rcend values in my edit of pagegenerators or is problem somewhere else?

Yes, or set the reverse parameter accordingly.

Yes, or set the reverse parameter accordingly.

Setting reverse to True seems more appropriate. I'll do that.
Thanks!

Though both recentChangesin imageuncat and RecentChangesPageGenerator in pagegenerators call site.recentchanges (here's a paste listing out both), the differences seem to be as follows:

gen.request['rcprop'] = 'title|user|comment|ids'
    for p in gen:
        yield pywikibot.Page(site, p['title'], p['ns'])
gen = (pywikibot.Page(site, x['title'])
          for x in gen if x['type'] != 'log' or 'title' in x)
  • gen.request['rcprop'] and the extra p['ns'] in recentChanges
  • the conditions on x in RecentChangesPageGenerator

If the above can be resolved, then I wouldn't need to have an extra from_imageuncat param.

I think that p['ns'] should not be an issue.

Actually I do not understand why the docstring in RecentChangesPageGenerator says:

@param changetype: only iterate changes of this type ("edit" for
        edits to existing pages, "new" for new pages, "log" for log
        entries)

while, then, log is excluded.

Wrong docstring or clause?

I think that p['ns'] should not be an issue.

Actually I do not understand why the docstring in RecentChangesPageGenerator says:

@param changetype: only iterate changes of this type ("edit" for
        edits to existing pages, "new" for new pages, "log" for log
        entries)

while, then, log is excluded.

Wrong docstring or clause?

We already have -logevents to deal with recent log entries, that is probably why it is excluded. I'll look into it, and confirm.

About the p['ns'] - should I include it or exclude it when merging with imageuncat's code?
And the issue of gen.request ['rcprop'] remains. What can be done about it if we merge the code without the from_imageuncat param?
@jayvdb?

I think that p['ns'] should not be an issue.

Agree.

Actually I do not understand why the docstring in RecentChangesPageGenerator says:

@param changetype: only iterate changes of this type ("edit" for
        edits to existing pages, "new" for new pages, "log" for log
        entries)

while, then, log is excluded.

Wrong docstring or clause?

Regarding for x in gen if x['type'] != 'log' or 'title' in x

This is to avoid trying to generate a Page for a log entry where the title has been suppressed. If the log entry has a title, it will be generated.

This is a feature of RecentChangesPageGenerator, and it is a bug in imageuncat's recentChanges. git blame shows it was slightly re-arranged in fc70846a, and the original bug fix is 996fa7e6.

So we dont want to re-implement this bug in imageuncat. (And this is a great example of why using a unified implementation in pagegenerators is a good idea: bug fixes in pagegenerators benefit all scripts that use it, and the bug fix is unlikely to be copied to separate implementations in scripts).

  • gen.request['rcprop'] and the extra p['ns'] in recentChanges

The default rcprop value is documented on mw:API:RecentChanges as timestamp|title|ids, so imageuncat value of title|user|comment|ids has two additional props : user and comment. They are explained on the aforementioned API manual page.

recentChanges fetches these additional properties, but then it only uses p['title'] and p['ns'] , so it is not using the values that prop comment or user provide.

So why is it there? It was added in September 2009 by 7654b64b , with a vague commit message referring to @Multichill who might know more about why it was added. However a little git blame analysis allows us to deduce why the change happened.

In September 2009, APISite.recentchanges was using rcprop="user|comment|timestamp|title|ids|redirect|patrolled|loginfo|flags" (it is only slightly different in the current 2016 code), so that commit was reducing the number of rc props being used: it removed timestamp and redirect|patrolled|loginfo|flags. A comment in that commit says:

  1. remove 'patrolled' from rcprop since we can't get it

So it seems the main objective of the patch was to remove patrolled from the list of requested props. It also avoided requesting unnecessary/unused properties. user and comment are two additional unused properties that could have removed been removed in that changeset, but were not. They are not needed.

Now when we look at RecentChangesPageGenerator, we see it is using APISite.recentchanges, and APISite.recentchanges does not request the patrolled prop. So, the main objective of this line in imageuncat's recentChanges appears to have been solved in APISite.recentchanges. Therefore, imageuncat should be able to use RecentChangesPageGenerator without any problem without a modified rcprop.
Could you please find the commit which removed the patrolled prop from APISite.recentchanges?

However, RecentChangesPageGenerator is requesting many other props which are unnecessary as they are not used. RecentChangesPageGenerator` only uses title and type -- everything else is discarded. So there is a performance improvement which can be made to RecentChangesPageGenerator, by setting the rcprop to only include the props which are used. This should be a separate (new) changeset.

That was certainly very instructive. Thanks. :D

Looking up APISite.recentchanges, I can see the code you speak of mentioning rcprop in recentchanges:

rcgen = self._generator(api.ListGenerator, type_arg="recentchanges",
                                rcprop="user|comment|timestamp|title|ids"
                                       "|sizes|redirect|loginfo|flags",
                                namespaces=namespaces, step=step,
                                total=total)

Interestingly, the September '09 change by Alex removed 'patrolled' because of denied permission. (Whitelist issues, maybe?) Anyway, here's the commit you wanted. :)

gen = (pywikibot.Page(site, x['title'])
          for x in gen if x['type'] != 'log' or 'title' in x)

So, since we don't want to re-implement the non-titled entries bug in imageuncat, that's the issue resolved here.

gen.request['rcprop'] = 'title|user|comment|ids'
    for p in gen:
        yield pywikibot.Page(site, p['title'], p['ns'])

And since we just saw how imageuncat doesn't, in fact, call the extra props, a non-modified rcprop will do. That's this issue solved, too. :D

I'll submit another patch with these changes in mind, then.

However, RecentChangesPageGenerator is requesting many other props which are unnecessary as they are not used. RecentChangesPageGenerator` only uses title and type -- everything else is discarded. So there is a performance improvement which can be made to RecentChangesPageGenerator, by setting the rcprop to only include the props which are used. This should be a separate (new) changeset.

I've created a new task T130189 for this changeset, which can be a blocking task for this one. Description literally being your words. :D Take a look?

I'm trying to understand the difference between @deprecated and @deprecated() decorators, as defined in tools, and I didn't quite understand what 'callable' in the description of @deprecated() meant:

# The latter is invoked with the decorator arguments as *args & **kwargs,
# and it must return a callable which will be invoked with the decorated
# function as args[0].

I've gone with: @deprecated('RecentChangesPagesGenerator') to deprecate recentChanges in -imageuncat.

I've submitted a patch for the blocking task as well as updated patch for this one, please take a look? @jayvdb

Change 276535 merged by jenkins-bot:
Reimplement imageuncat's -recentchanges as a pagegenerators CLA

https://gerrit.wikimedia.org/r/276535