Page MenuHomePhabricator

Add page protection filter to CirrusSearch
Open, MediumPublic

Description

Search is often used for finding articles to edit; the ability to exclude protected articles would make that more effective. GrowthExperiments, which offers articles with simple editing tasks to newcomers (and thus needs to avoid recommending protected articles), currently filters out protected articles on the client side (via action=query&prop=info&inprop=protection) which is far from ideal and makes proper handling of result sizes and offsets impossible.

It would be nice to have a CirrusSearch keyword (maybe hasprotection:edit / hasprotection:move?) for filtering for protection status. Page protection changes are accompanied by a null edit, which pushes status changes to the search index, so AIUI all that would be needed is to add a protection field to the ElasticSearch index, add it to the EventBus event for new revisions, and register the appropriate search feature.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
CBogen triaged this task as Medium priority.Aug 27 2020, 9:11 PM

Some discussion from IRC today with @dcausse:

10:35 <kostajh> hi dcausse, tgr and I were wondering about the magnitude of effort it would need to do T259346
10:35 <stashbot> T259346: Add page protection filter to CirrusSearch - https://phabricator.wikimedia.org/T259346
10:38 <+dcausse> kostajh: hi, I'll take a look and discuss with the rest of the team, at first glance it does not look terribly hard
10:45 <+dcausse> if page protection happens during the normal editing process I wonder if it'd not be simpler to just add this info to the standard features provided by cirrus
10:50 <kostajh> dcausse: thanks for looking!
10:50 <kostajh> yes, afaik, it happens in the process of editing, but I'm not super familiar with the mechanism
10:52 <+dcausse> me neither but if the data is available when cirrus needs to build its doc it'll be fairly trivial
11:01 <kostajh> dcausse: right, so the action is initiated via Article::protect(), and during that process a null revision is created and at that point we know that the article is now protected (or was protected and no longer is)
11:16 <+dcausse> kostajh: great, looks like the info is stored in mysql so should be fast to access but what's annoying is that it has a notion of expiration date, perhaps no big deal if we store the expiration date
11:17 <kostajh> after the expiration date, I believe a job will update the page with another null revision to reflect the updated protection status
11:18 <+dcausse> oh that simplifies things a lot
11:19 <kostajh> that's my assumption anyway...
11:25 <kostajh> bah, a faulty assumption
11:27 <kostajh> I guess it would be possible to enqueue a job to run shortly after the known expiration date, and the job would have to look to see if this protection is still the latest one that is valid for the page (e.g. no other protections were added/removed in the meantime), and that could trigger a null edit
11:28 <+dcausse> we could, I have no clue how that would be, but we could also look into indexing the expiration date and use that for filtering
11:29 <+dcausse> s/no clue how that/no clue how hard that

Thanks for looking into it!

I forgot about the expiration part, that does make it more complicated. I think the easier approach for that would be to store the expiry timestamp in the Cirrus index, and the search would be for articles with an expiry date in the future. Jobs can fail or get lost, leaving the search index incorrect, so that would be more fragile.

Tgr moved this task from Inbox to External on the Growth-Team board.

@MMiller_WMF this is the task mentioned in the planning meeting - if we have to, we could build a protection filter on the PHP side, but integrating it into search would be more performant and more elegant with pagination and such, and the functionality seems useful for many potential use cases.

@Gehel any thoughts on where this task might fit (or not) into your team's scheduled work?

@kostajh , this task isn't currently scheduled to be worked on in the near future due to a lot of other high priority work, though it has been triaged for us to handle later (i.e. no current estimate on when it might currently be addressed).

From the ticket's description, it seems like the the task recommendations is currently functional without a page protection filter, and that this is more of a nice to have for the project? Let me know if I'm mistaken about how crucial the requested page protection functionality is, and we can look into potentially reprioritizing this task for work sooner.

@kostajh , this task isn't currently scheduled to be worked on in the near future due to a lot of other high priority work, though it has been triaged for us to handle later (i.e. no current estimate on when it might currently be addressed).

From the ticket's description, it seems like the the task recommendations is currently functional without a page protection filter, and that this is more of a nice to have for the project? Let me know if I'm mistaken about how crucial the requested page protection functionality is, and we can look into potentially reprioritizing this task for work sooner.

Thanks @MPhamWMF, yes, it's a nice-to-have.

Currently, to show newcomer tasks on Special:Homepage, we perform a search, then filter out protected pages after getting the results. We now also generate some statistics on Special:NewcomerTasksInfo; before implementing protected page filtering there, I wanted to see what the status of this task was.