Page MenuHomePhabricator

Port replace.py saving options from compat
Open, HighPublic

Description

In compat replace.py has these options: -save, -savenew, -savexc, -savexcnew
With -save / -savenew the bot will save titles to a file rather than making replacements. This makes possible "two-run workflow": first collecting titles automatically, which may be slow, without manual intervention, than working fast on the saved results.

With -savexc / -savexcnew we get a new choice "x": "Don't replace, and save the title to exceptions". The resulted exceptions may be copied from file to the exceptions dictionary of the fix for later use (next time you don't have to go through the same pages and waste your time).

For convenience, compat has a counter for both cases that writes the number of the saved titles to console upon finishing.

Without these features core version is useless for high-volume, repeated work. See also T99365.

Event Timeline

binbot created this task.Sep 4 2016, 12:58 PM
Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald TranscriptSep 4 2016, 12:58 PM

I've merged T57689: Implement -save option on replace.py core here because this is a more general task (not just -save).

Questions:

  • How much of this workflow can be supported by using listpages.py -grep ?
  • Do we need both -save and -savenew, or can we just output the list to stdout, and let the user redirect using > or >>?
  • Same for saveexc.
  • -save does not seem to save page titles if -always is passed. Is that on purpose?
binbot added a comment.EditedSep 4 2016, 2:19 PM

I don't know listpages as I always used compat.

-save/savenew (with flush() after every save) works reliably even for very slow, long search tasks. (I mean up to several days!) Even if there is a crush somewhere, or I have to disconnect my computer from network, the result is there, and I can continue later with -xmlstart from the last saved title (or whatever the script suggested me when I press ctrl c).
I suppose, this is not the case with > (and it still my be operating system dependent).

More to say, if it lasts too long, I can work parallel. When the bot saved 300 titles and still works on the next 3000, I can process the saved titles from another command window. This is much more flexibility than that of > / >>.

I extensively use both save and savenew.

Saveexc is different, it is created during the work. For using > you should create two output streams (one for console output, one for exceptions), and all data might be lost upon unexpected termination.

In compat -save works fine for me either with -always or without it.

I don't know listpages as I always used compat.

Then could you please set up core and test whether listpages solves this use case?

-save/savenew (with flush() after every save) works reliably even for very slow, long search tasks. (I mean up to several days!) Even if there is a crush somewhere, or I have to disconnect my computer from network, the result is there, and I can continue later with -xmlstart from the last saved title (or whatever the script suggested me when I press ctrl c).
I suppose, this is not the case with > (and it still my be operating system dependent).

Why would this not be possible with >?

More to say, if it lasts too long, I can work parallel. When the bot saved 300 titles and still works on the next 3000, I can process the saved titles from another command window. This is much more flexibility than that of > / >>.

stdout can also be flushed.

Saveexc is different, it is created during the work. For using > you should create two output streams (one for console output, one for exceptions), and all data might be lost upon unexpected termination.

We already split output (user output goes to stderr, pipe-able output goes to stdout), so I'm unsure why this would not work, nor why there would be data loss in some cases?

In compat -save works fine for me either with -always or without it.

Never mind, I misread the code. It's indeed saved.

binbot added a comment.Sep 4 2016, 2:40 PM

I will set up core but that takes more. For the moment I visited https://www.mediawiki.org/wiki/Manual:Pywikibot/Scripts which states: "Compat equivalent: pagegenerators.py and get.py." I also looked into the code and it does not have the complexity of fixes. I don't need a simple pagegenerator for this task, I need replace.py with -save that uses the same fixes and exceptions (included T144693) that the real replacements use. My fixes are rather sophisticated and I don't see any sign in listpages.py that it could provide this level.

binbot added a comment.EditedSep 4 2016, 2:56 PM

I also don't know how much experience you have with frustrating console encoding problems, but I have suffered from them for years. While saved fixes always worked well, it was not the same with command line arguments. I had really, really many problems with that. Now, -save opens a relaiable UTF-8 file using codecs. That is proved. I don't feel the energy for trying new methods and experimenting again once something had already worked fine.

Or maybe listpages functionality should be extended instead.

Ato_01 added a subscriber: Ato_01.Aug 3 2017, 5:52 AM
Xqt renamed this task from Port saving options from compat to Port replace.py saving options from compat.Aug 24 2018, 11:24 AM
4nn1l2 removed a subscriber: 4nn1l2.Aug 24 2018, 11:53 AM