Restartable bot framework
Open, LowPublic
Actions

Assigned To

None

Authored By

	jayvdb
	Jul 9 2016, 10:14 AM

Description

The pywikibot codebase is almost ready for a bot pause/resume project.

Many of the scripts now have a Bot class which inherits from pywikibot.bot.Bot , with a run method which is an inherited method, or behaves in a consistent manner. Thus it is easy to place the pause/resume functionality into the Bot class and implemented & test it for multiple scripts.

Mentors: @DrTrigon

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T139842 Restartable bot framework
		Invalid		None	T199643 Generate a new CSRF token if the old one is invalidated

Event Timeline

jayvdb created this task.Jul 9 2016, 10:14 AM

Restricted Application added subscribers: pywikibot-bugs-list, Zppix, Aklapper. · View Herald TranscriptJul 9 2016, 10:14 AM

DrTrigon added a project: User-DrTrigon.Jul 9 2016, 7:19 PM

DrTrigon moved this task from Backlog to GSoC on the User-DrTrigon board.

As I see we need at least to have throttle and http functions able to be paused, that should cover already most we need, right?

I think the simplest option is to stop after a given page has been processed, and continuing there the next time.

I agree with @valhallasw that this should be page based resumption.

The simplest solution (MVP) would, on pause, continue to run the existing generator, and write out a page title list to a file instead of processing the pages.
Then the resumption would use the page title list in the stored file using -file:xxx.

A better implementation would, on pause, store the next page title to be processed, and all of the generator arguments.
Then the resumption would create a new generator which resumes where the last generator stopped.

We need both approaches (e.g. the former is better if the input generator was -file:xxx, and the second approach will not be possible for some API generators that do not support a 'start from' title argument.

And there are some generators that are not pause/resume-able, like -random, in which case the user should simply re-run their original command to continue.

Resumption at the http/network level depends on the MediaWiki api maintaining/respecting old continuation data. This is not unreasonable in many cases, as the API continuation data is often page titles, etc.
However, while we may be able to pause/resume the http API process, the user may have hit pause at the first record of 5000 records in the last http API resultset, so that implementation will still need to handle the case of injecting cached data into the API layer before switching to resuming fetching from the http layer.

The simplest solution (MVP) would, on pause, continue to run the
existing generator, and write out a page title list to a file instead
of processing the pages.
Then the resumption would use the page title list in the stored file
using -file:xxx.

A better implementation would, on pause, store the next page title to
be processed, and all of the generator arguments.
Then the resumption would create a new generator which resumes where
the last generator stopped.

I agree and have to note that something similar is in catimages too... ;) From there I now this simple approaches can go terribly wrong if the page to resume with has been deleted meanwhile... Other than that it doable and will cover the majority of the cases.
Dr. Trigon

In T139842#2446119, @jayvdb wrote:

I agree with @valhallasw that this should be page based resumption.

The simplest solution (MVP) would, on pause, continue to run the existing generator, and write out a page title list to a file instead of processing the pages.
Then the resumption would use the page title list in the stored file using -file:xxx.

Both approaches are needed as the one above does not apply when you want to pause because you need to logoff.

Why does it not apply in that case?

I f I need to switch off my computer, how can I let it "write out a page title list to a file instead of processing the pages"?

Maybe @Mpaa meant forced shutdown, or signal 9, in which case letting the
generator run in collect-only mode isnt possible.

catimages just stored the name of the last file processed and then continued from there on restart.

Instead of outputting to text file the skipped ones, output the processed ones. In case the last one got deleted and the generator returns another list we have plenty of fallback possiblities... 2nd last, the one before that, etc.

@jayvdb and @DrTrigon, would you like to feature this project for GSoC/Outreachy May-Aug 2017? If yes, please add tag Outreach-Programs-Projects

zhuyifei1999 subscribed.Feb 8 2017, 6:56 PM

@jayvdb; if you are intressted, then I am willing to help of course... ;)

What about @AbdealiJK?

jayvdb updated the task description. (Show Details)Mar 9 2017, 3:15 AM

I havent been actively developing Pywikibot, and preparing to mentor this particular project (which would fiddle deep in the belly of pywikibot) would require getting up to speed on a lot of changes since I very active.

Removing the Possible-Tech-Projects tag as we are planning to kill it soon! This project does not seem to fit in the Outreach-Programs-Projects category in its current state, so I am not adding that tag right now!

zhuyifei1999 merged a task: T199640: Provide a way to store a breakpoint when running some scripts.Jul 15 2018, 5:12 AM

zhuyifei1999 added a subscriber: Bugreporter.

Bugreporter mentioned this in T199643: Generate a new CSRF token if the old one is invalidated.Jul 15 2018, 10:01 AM

Framawiki subscribed.Jul 18 2018, 6:14 PM

Xqt triaged this task as Low priority.Sep 29 2018, 5:36 AM

Xqt added a subtask: T199643: Generate a new CSRF token if the old one is invalidated.Jul 30 2019, 5:56 PM

Xqt closed subtask T199643: Generate a new CSRF token if the old one is invalidated as Invalid.Oct 18 2022, 4:33 AM