Page MenuHomePhabricator

Add "level=n" option to the template for specifying the heading level of the sections to be archived
Open, NormalPublic

Description

AFAIK, archivebot currently only checks threads under level 2 headings.

But for some pages, such as WP:RFP, it's needed to only archive level 3 headings.

Provide some option to restrict the archiving only to a certain level.


Since this looks like a rather complex request, here is an attempt to recap.

Proposal

  • The bot should recognize a new template argument in the form of "level=3" which tells the bot to archive threads starting with "=== Title ===", and keep headings in higher levels intact.
  • The bot should replicate the higher thread structure from the original page into the archive page. The bot should parse the archive page to determine the locations where threads should be archived. See Dalba's comment for an example.
    • In an archive page, threads of the target level should probably be ordered within a parent thread by last timestamps (consistent with the current state).

Current state

  • The bot archives level-2 threads (starting with "== Title ==") only. If a level-2 thread has level-3 threads such as "=== Sub topic 1 ===" and "=== Sub topic 2 ===" in it, those will stay together if one of the two is old enough.
  • The bot doesn't (have to) parse archive pages. Threads newly archived are always appended at the bottom. This effectively reorders threads by the last message's timestamp.

Event Timeline

Dalba created this task.Nov 28 2015, 12:45 PM
Dalba updated the task description. (Show Details)
Dalba raised the priority of this task from to Needs Triage.
Dalba added a subscriber: Dalba.
Dalba updated the task description. (Show Details)Nov 28 2015, 12:48 PM
Dalba set Security to None.
jayvdb added a subscriber: jayvdb.

@Dalba, would you be able to co-mentor this enhancement with me if we added this as a Google-Code-In-2015 task? If someone submits a solution for the task, we would need to test their solution within 24 hrs, so it would be safer if there are two mentors who can test the code.

As long as it's just testing, sure, I can test! :)

@Dalba , great. can you do https://www.mediawiki.org/wiki/Google_Code-in_2015#Become_a_Wikimedia_GCI_mentor , and maybe ping Andre to send you a mentor invite.

Dalba added a comment.Dec 11 2015, 5:46 AM

@jayvdb I just noticed that Google does not allow participation from my country (Iran) and shows this message: "We're sorry, but this service is not available in your country. That’s all we know." Although I may be able to circumvent it using a proxy server, I prefer not to. Anyway, as I know students will submit their patches here on Phabricator and Gerrit, so hopefully I'll still be able to help on testing them if needed.

ok. ill set up the GCI task and ping you directly when I get the notification from the codein servers.

murfel claimed this task.Dec 14 2015, 12:04 PM

Change 258988 had a related patch set uploaded (by Murfel):
Allow archivebot to archive certain heading levels

https://gerrit.wikimedia.org/r/258988

I used the following on Test Wiki to test it:

python pwb.py scripts/archivebot.py -ns:2 -page:Murasha/sandbox -level:3 User:Murasha/sandbox

It will affect the following pages (prepared them for you to test the script too):
test:User_talk:Murasha/sandbox
test:User_talk:Murasha/sandbox/arch

(All these 'Level 3', 'Yet another Level 3' won't be archived for some reason. I added them when I tried to cook an archivable page, then I run the script on them, and received some errors. I think it was caused by timestamp absence. )

Yes, archivebot expects timestamps in each archivable section.

Dalba added a comment.EditedDec 15 2015, 12:45 AM

I hope this is not too much to ask, but Ideally, when the bot archives a level 3 header, it should archive it under a level 2 header with the same title.

For example, consider the following case:

The main page content is like:

== h21 ==

=== h31 ===
archive me. ~~~~


== h22 ==

=== h32 ===
archive me. ~~~~

The archive content is like:

== h21 ==

=== h30 ===
previously archived. ~~~~


== h22 ==

=== h30 ===
previously archived. ~~~~

The new archive should look like:

== h21 ==

=== h30 ===
previously archived. ~~~~

=== h31 ===
archive me. ~~~~


== h22 ==

=== h30 ===
previously archived. ~~~~

=== h32 ===
archive me. ~~~~

Of-course doing this requires parsing of the archive page.

Mpaa added a subscriber: Mpaa.Dec 15 2015, 8:48 PM

@Dalba, that sounds reasonable, I agree with you.

whym added a subscriber: whym.EditedFeb 11 2016, 7:20 AM

I'm not sure https://gerrit.wikimedia.org/r/#/c/258988/ does what this task asks for. I'm not saying the patch is not useful, but the patch seems to solve a different issue.

It sounds like this tasks asks for a per-page ability to specify the level, not a per-run configuration in the command line. English Wikipedia's WP:RFP needs level-3 archiving while many other pages there needs level-2 archiving.

I'll take that into account if I manage to get to this. (Anyone else should feel free to overtake this task.)

Mpaa added a comment.Feb 11 2016, 6:38 PM

Sounds like level should be part of the configuration template then?

whym renamed this task from Add the ability to archive only threads with certain heading levels to Add "level=n" option to the template for specifying the heading level of the sections to be archived.Feb 13 2016, 12:20 PM

Sounds like level should be part of the configuration template then?

I think so and I have edited the title accordingly.

whym updated the task description. (Show Details)Feb 21 2016, 7:49 AM
whym added a subscriber: murfel.

I was about to create a Task for requesting support to define which headings the bot should archive on a given page. This will be very helpful on some project pages. I'd be very glad if this could be done. Best regards.

MarcoAurelio triaged this task as Normal priority.EditedJul 21 2016, 10:50 AM

To clarify, it should be possible to add in the archive template which we put on pages a parameter like |level = n, n being the level of the headers we'd like to archive. If, in adition, we want to add a command on the command line, that'd be good to.

@murfel Are you still working on this?

Change 258988 abandoned by Murfel:
Allow archivebot to archive certain heading levels

https://gerrit.wikimedia.org/r/258988

MarcoAurelio added subscribers: Xqt, valhallasw.EditedNov 22 2017, 9:57 AM

@Xqt @valhallasw Is this task eligible for Google-Code-in-2017? If so, would you like to mentor it?

Ping also @jayvdb and @Legoktm as they're registered in the MediaWiki GCI page.

Being bold and proposing this for GCI.

Dvorapa removed a subscriber: Dvorapa.
whym removed murfel as the assignee of this task.Nov 23 2017, 1:27 AM

I'll take that into account if I manage to get to this. (Anyone else should feel free to overtake this task.)

Since it sounds like you are not actively working on it, I'll remove you from the assignee.

whym added a comment.Nov 23 2017, 1:28 AM

Re: GCI: it appears that GCI expects a participant to solve many small and beginner friendly tasks during the period, not one large project:

The organizations create a large list of short (3-5 hour) tasks for students to work on. (https://developers.google.com/open-source/gci/how-it-works)

If that is the case, this task does not seem like a great choice. It's fairly complex. If you want to see it moving forward, I'd suggest creating subtasks that are easier to solve.

(That said, I might be underestimating it - of course if someone can solve the complex task within GCI, that would be great.)

Dvorapa added a subscriber: Dvorapa.

@Dvorapa: Do you plan to mentor this task? If so feel free to create it on the GCI site. :)

Yes, I'll find some spare time and make GCI tasks for this

@Dvorapa: You have two more weeks for Google-Code-in-2018, if you still plan to. :)