Page MenuHomePhabricator

Allow searching articles by WikiProject
Closed, ResolvedPublicFeature

Description

Since WikiProject tags are placed on the talk page (rather than on the page itself), it makes it impossible to filter search results to pages in a WikiProject through the hastemplate or incategory search operators.

The PageAssessments extension nevertheless knows about the WikiProject associations. Using that data, an inproject operator should be provided to facilitate this use case.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
SD0001 updated the task description. (Show Details)

Change #1086539 had a related patch set uploaded (by SD0001; author: SD0001):

[mediawiki/extensions/PageAssessments@master] Add CirrusSearch integration to enable searching by WikiProject

https://gerrit.wikimedia.org/r/1086539

Change #1086540 had a related patch set uploaded (by SD0001; author: SD0001):

[integration/config@master] zuul: PageAssessments: Add CirrusSearch as a phan dependency

https://gerrit.wikimedia.org/r/1086540

Change #1086540 merged by jenkins-bot:

[integration/config@master] zuul: PageAssessments: Add CirrusSearch as a phan dependency

https://gerrit.wikimedia.org/r/1086540

In addition to the onSearchDataForIndex2 CirrusSearch has now another way to ingest simple tags into the search index that might not need to update the search index mapping and introducing a new field. This is called WeightedTags and my understanding that this use-case might be a good fit for it.
Instead of supplying the data with onSearchDataForIndex2 which is called every time CirrusSearch builds a document your data can be pushed when you detect a change. My understanding reading the PageAssessment code-base is that you detect this via the LinksUpdate hook.
Perhaps PageAssessmentsDAO can be adapted to call the CirrusSearch WeightedTagsUpdater service if available.

What's unclear to me here is how many pages might be impacted at this point and whether or not calling WeightedTagsUpdater will be prohibitively slow for your LinksUpdate hook to complete in a reasonable amount of time.

In your patch you index the project name but I wonder if indexing the project_id might not better? (esp. if project can be renamed?). At search time it is OK for keywordz to do a lookup implementing the \CirrusSearch\Query\KeywordFeature::expand( KeywordFeatureNode $node, SearchConfig $config, WarningCollector $warningCollector ) possibly providing a warning if the project does not exist. Please see InCategoryFeature for some example on how to implement expand.

With weighted tags you might also get the opportunity to index a weight that could possibly be used to rank pages when using this new inproject keyword (perhaps the importance can be converted into a weight?).

For the query the keyword might be very similar to what you already wrote but a very similar one you could get some inspiration from is HasRecommendationFeature if you do not set any weight and do no want the keyword to have an influence on the rank. If on the other hand you believe that ranking based on the importance might be interesting you can take inspiration from ArticleTopicFeature which is no longer a FilterQueryFeature and implements the scoring query in doApply.

Problem with adding new data this way is that the data will get populated only when the talk page gets a LinkUpdate, and it might be handy to have a maintenance script that populates this tag initially (using the same WeightedTagsUpdater service).

@SD0001 please let me know if you have questions and/or concerns about this approach. Thanks!

Thanks @dcausse for detailed explanation.

Using weighted_tags does indeed appear to a superior solution as it allows (near-)real-time updates when the projects on talk page are updated. The SearchDataForIndex2 contract is too inflexible – it doesn't allow updating the ES doc of the subject page when the talk page is edited, leaving things inconsistent until the next time the subject page is edited.

I notice that although the schema in ElasticSearch is not wikimedia-specific, weighted tags are gated behind a Wikimedia-specific feature flag in CirrusSearch codebase. Is there a plan to change this? Unlike articletopic and linkrecommendation, the PageAssessments extension is usable in third-party wikis.

What's unclear to me here is how many pages might be impacted at this point and whether or not calling WeightedTagsUpdater will be prohibitively slow for your LinksUpdate hook to complete in a reasonable amount of time.

As the LinksUpdate process is not user-facing and takes place in a job, I don't think performance would be a problem.

In your patch you index the project name but I wonder if indexing the project_id might not better? (esp. if project can be renamed?).

Renaming the project causes it to get a new id, so conceptually I think there's no advantage. I'm not very familiar with ElasticSearch though – is there a performance benefit to indexing by a numeric field? If so, it would indeed make sense to index by the id.

With weighted tags you might also get the opportunity to index a weight that could possibly be used to rank pages when using this new inproject keyword (perhaps the importance can be converted into a weight?).

Sounds like a great idea!

Thanks @dcausse for detailed explanation.

Using weighted_tags does indeed appear to a superior solution as it allows (near-)real-time updates when the projects on talk page are updated. The SearchDataForIndex2 contract is too inflexible – it doesn't allow updating the ES doc of the subject page when the talk page is edited, leaving things inconsistent until the next time the subject page is edited.

I notice that although the schema in ElasticSearch is not wikimedia-specific, weighted tags are gated behind a Wikimedia-specific feature flag in CirrusSearch codebase. Is there a plan to change this? Unlike articletopic and linkrecommendation, the PageAssessments extension is usable in third-party wikis.

Indeed I see that they are behind CirrusSearchWMFExtraFeatures... this is annoying because technically speaking weighted_tags are not WMF specific, I think it's there because the first feature we implement with them was very much wmf specific (ingesting ores article topics), I'll try to clarify that and possibly change the config a bit so that it's less confusing.
The only bits that's required and perhaps something we should document in CirrusSearch is that they require the installation of the WMF extra plugin. My understanding is that most thirdparties using cirrus are installing this plugin and the highlighter (without them you don't get the nice features like e.g. regex search with insource://).

What's unclear to me here is how many pages might be impacted at this point and whether or not calling WeightedTagsUpdater will be prohibitively slow for your LinksUpdate hook to complete in a reasonable amount of time.

As the LinksUpdate process is not user-facing and takes place in a job, I don't think performance would be a problem.

In your patch you index the project name but I wonder if indexing the project_id might not better? (esp. if project can be renamed?).

Renaming the project causes it to get a new id, so conceptually I think there's no advantage. I'm not very familiar with ElasticSearch though – is there a performance benefit to indexing by a numeric field? If so, it would indeed make sense to index by the id.

There are no perf consideration here, technically speaking elastic will see a string tag_prefix/tag anyways, it would just have made sense if renaming was supported, the only thing I would be careful about is that if there are project with special characters in them that would make searching for them a bit annoying, e.g. if they contain quotes: Project for "something", these might be annoying to search because of the search syntax forcing the user to escape them. If this happens and becomes really annoying we could think about adding a config that maps these project to some simpler strings in the same vein of ArticleTopicFeature to improve usability.

With weighted tags you might also get the opportunity to index a weight that could possibly be used to rank pages when using this new inproject keyword (perhaps the importance can be converted into a weight?).

Sounds like a great idea!

Change #1086539 merged by jenkins-bot:

[mediawiki/extensions/PageAssessments@master] Add CirrusSearch integration to enable searching by WikiProject

https://gerrit.wikimedia.org/r/1086539

This change rolled out to enwiki today. WeightedTagsUpdate traffic from this feature peaked to 177k/hr two hours ago. The rate of requests entering the pipeline is roughly same as that of tags being written, so looks like all is good. A search for inproject:Biography returns 238k pages as of writing.

The tag traffic should significantly slow down when r1088592 goes out.

@SD0001 awesome!

I guess we can monitor another metric via the elastic write results graph (filtering on TAGS_UPDATE | UPDATED vs TAGS_UPDATE | NOOP), once NOOP is high compared to UPDATED this should be a good indication that your optimisation at r1088592 can be shipped safely.

Search for inproject:Biography returns 2.37 million pages now, which is less than the transclusions of Template:WikiProject banner (2.5 million), but seems close enough.

SD0001 claimed this task.
SD0001 added a project: User-notice.

Suggested blurb for tech news: On wikis with PageAssessments installed, you can now filter search results to pages in a given WikiProject by using the inproject: keyword.

FYR, this extension is available on only a few non-beta wikis: https://extloc.toolforge.org/extensions/PageAssessments.

Playing with this, I noticed I have to do inproject:"United States" for that project to handle the space, and if I want to do a subproject of it, I have to do something like this: inproject:"United States/WikiProject Louisville"

I discovered this from a while back when there was an inability for CleanupWorklistBot to pull rating/importance data for its output for WP US subprojects until the above format was used.

This may be useful to cover in documentation.