Page MenuHomePhabricator

pagegenerators.py -match option
Open, MediumPublic

Description

Hello! It would be nice, if you add -match option to pagegenerators.py, that means that script will work only on pages which match some regexp.

Details

Reference
bz55078

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:18 AM
bzimport set Reference to bz55078.
bzimport added a subscriber: Unknown Object (????).
Legoktm created this task.Oct 5 2013, 4:20 AM

How would this be different than the "-regex" option that already exists?

  • status: open --> pending

-regex is used for replacement. How can we solve task such "replace text using some regex if article match some regex"?

  • status: pending --> open

do you expect sth like -requiretext:XYZ \(in combination with -regex\) which could be the opposite of -excepttext:XYZ, analogous to the existing -requiretitle vs -excepttitle ?

Yep, something like that.

jayvdb triaged this task as Medium priority.Jun 9 2015, 5:59 AM
jayvdb edited projects, added Pywikibot-replace.py, Pywikibot; removed Pywikibot-Scripts.
jayvdb set Security to None.
Omegat added a subscriber: Omegat.Jun 22 2015, 8:25 AM
binbot added a subscriber: binbot.Sep 4 2016, 12:32 PM

I know a workaround for this in compat, but that feature has not been ported to core yet. Hopefully will be soon...

(First round: use replace.py "someregex" "foobar" -save:something.txt, then do the actual replacements with -file:something.txt.

As far as I can see, the -grep option provides this:

-grep             A regular expression that needs to match the article
                  otherwise the page won't be returned.
                  Multiple -grep:regexpr can be provided and the page will
                  be returned if content is matched by any of the regexpr
                  provided.
                  Case insensitive regular expressions will be used and
                  dot matches any character, including a newline.
Dvorapa added a subscriber: Dvorapa.EditedJun 10 2017, 5:09 PM

As far as I can see, the -grep option provides this:

-grep             A regular expression that needs to match the article
                  otherwise the page won't be returned.
                  Multiple -grep:regexpr can be provided and the page will
                  be returned if content is matched by any of the regexpr
                  provided.
                  Case insensitive regular expressions will be used and
                  dot matches any character, including a newline.
  1. -grep matches regex in page title
  2. -search:'insource://' matches regex inside page content. This is not ideal (doesn't work with ^,$,\s,...), but it usually will do just fine
  1. -search:'insource://' matches regex inside page content. This is not ideal (doesn't work with ^,$,\s,...), but it usually will do just fine

Why do we need not ideal solutions?

Dvorapa added a comment.EditedJun 10 2017, 6:21 PM

Why do we need not ideal solutions?

-search:'insource://' is just a workaround for this issue. It works with MediaWiki's CirrusSearch, which does not support some regex operators (e.g. T135280)

OK. Replace.py is very important for me as I heavily use it for multiple purposes, and I am highly interested in its performance. I contributed a lot to the compat version, but now I have troubles both with using the core and with coming back to development, but I will look inside the problem when I am able.

  1. -grep matches regex in page title

No, -grep matches the page contents. -titleregex matches the page title. This is clearly documented (https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/pagegenerators.py#L307, https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/pagegenerators.py#L226), and corresponds (as far as I can see) to the actual code.

Dvorapa added a comment.EditedJun 10 2017, 6:30 PM
  1. -grep matches regex in page title

No, -grep matches the page contents. -titleregex matches the page title. This is clearly documented (https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/pagegenerators.py#L307, https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/pagegenerators.py#L226), and corresponds (as far as I can see) to the actual code.

That's weird, for me -grep never worked for page contents at all. I'll try next time once more

Update: It works finally, wow, but still missing a generator instead

Dvorapa added a comment.EditedJun 10 2017, 6:38 PM

But maybe the more important difference:

  1. -search:'insource://' is a generator
  2. -grep is a filter

Because -search:'insource://' is not ideal (missing support for \s, \n, ^, $, ...), a solution to this task would be still helpful

Dvorapa renamed this task from replace.py -match option to pagegenerators.py -match option.Jun 10 2017, 6:40 PM
Dvorapa updated the task description. (Show Details)
Dvorapa updated the task description. (Show Details)