pagegenerators.py -match option
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	Legoktm
	Oct 5 2013, 4:20 AM

Description

Hello! It would be nice, if you add -match option to pagegenerators.py, that means that script will work only on pages which match some regexp.

Details

Reference: bz55078

Related Objects

Mentioned Here: T135280: insource: queries can't search for the return character
T144692: Port replace.py saving options from compat

Event Timeline

• bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:18 AM

• bzimport added a project: Pywikibot-Scripts.

• bzimport set Reference to bz55078.

• bzimport added a subscriber: Unknown Object (????).

Legoktm created this task.Oct 5 2013, 4:20 AM

How would this be different than the "-regex" option that already exists?

status: open --> pending

-regex is used for replacement. How can we solve task such "replace text using some regex if article match some regex"?

status: pending --> open

do you expect sth like -requiretext:XYZ $in combination with -regex$ which could be the opposite of -excepttext:XYZ, analogous to the existing -requiretitle vs -excepttitle ?

Yep, something like that.

jayvdb triaged this task as Medium priority.Jun 9 2015, 5:59 AM

jayvdb edited projects, added Pywikibot-replace.py, Pywikibot; removed Pywikibot-Scripts.

jayvdb set Security to None.

Omegat subscribed.Jun 22 2015, 8:25 AM

I know a workaround for this in compat, but that feature has not been ported to core yet. Hopefully will be soon...

(First round: use replace.py "someregex" "foobar" -save:something.txt, then do the actual replacements with -file:something.txt.

See also T144692.

As far as I can see, the -grep option provides this:

-grep             A regular expression that needs to match the article
                  otherwise the page won't be returned.
                  Multiple -grep:regexpr can be provided and the page will
                  be returned if content is matched by any of the regexpr
                  provided.
                  Case insensitive regular expressions will be used and
                  dot matches any character, including a newline.

In T57078#2607540, @valhallasw wrote:

As far as I can see, the -grep option provides this:

-grep             A regular expression that needs to match the article
                  otherwise the page won't be returned.
                  Multiple -grep:regexpr can be provided and the page will
                  be returned if content is matched by any of the regexpr
                  provided.
                  Case insensitive regular expressions will be used and
                  dot matches any character, including a newline.

-grep matches regex in ~~page title~~
-search:'insource://' matches regex inside page content. This is not ideal (doesn't work with ^,$,\s,...), but it usually will do just fine

In T57078#3337396, @Dvorapa wrote:

-search:'insource://' matches regex inside page content. This is not ideal (doesn't work with ^,$,\s,...), but it usually will do just fine

Why do we need not ideal solutions?

In T57078#3337467, @binbot wrote:

Why do we need not ideal solutions?

-search:'insource://' is just a workaround for this issue. It works with MediaWiki's CirrusSearch, which does not support some regex operators (e.g. T135280)

OK. Replace.py is very important for me as I heavily use it for multiple purposes, and I am highly interested in its performance. I contributed a lot to the compat version, but now I have troubles both with using the core and with coming back to development, but I will look inside the problem when I am able.

In T57078#3337396, @Dvorapa wrote:

-grep matches regex in page title

No, -grep matches the page contents. -titleregex matches the page title. This is clearly documented (https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/pagegenerators.py#L307, https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/pagegenerators.py#L226), and corresponds (as far as I can see) to the actual code.

In T57078#3337475, @valhallasw wrote:

In T57078#3337396, @Dvorapa wrote:

-grep matches regex in page title

No, -grep matches the page contents. -titleregex matches the page title. This is clearly documented (https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/pagegenerators.py#L307, https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/pagegenerators.py#L226), and corresponds (as far as I can see) to the actual code.

That's weird, for me -grep never worked for page contents at all. I'll try next time once more

Update: It works finally, wow, but still missing a generator instead

But maybe the more important difference:

-search:'insource://' is a generator
-grep is a filter

Because -search:'insource://' is not ideal (missing support for \s, \n, ^, $, ...), a solution to this task would be still helpful

Dvorapa renamed this task from replace.py -match option to pagegenerators.py -match option.Jun 10 2017, 6:40 PM

Dvorapa edited projects, added Pywikibot-pagegenerators.py; removed Pywikibot-replace.py.

Dvorapa updated the task description. (Show Details)

We already have a -grep filter. This will work together with any generator e.g. with -start. There is not such -match filter on API side which can be used.

pagegenerators.py -match optionClosed, DeclinedPublicActions

Description

Details

Related Objects

Event Timeline

pagegenerators.py -match option
Closed, DeclinedPublic
Actions