Add hasrecommendation: search keyword
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	Tgr
	Dec 5 2020, 7:14 AM

Description

As part of the Add Links work (T268803: Add a link engineering: Search pipeline), we need a search keyword for restricting search results to pages which have link recommendations prepared for them.

Ideally this should be reusable for other recommendation types; hasrecommendation:link seems like the straightforward way to do it, in line with the existing hastemplate: keyword.

Details

Subject	Repo	Branch	Lines +/-
Fix string weight handling in CirrusSearch::updateWeightedTags	mediawiki/extensions/CirrusSearch	master	+5 -3
Validate DataSender::sendUpdateWeightedTags() better	mediawiki/extensions/CirrusSearch	master	+110 -5
hasrecommendation: Write and query BC field as well	mediawiki/extensions/CirrusSearch	master	+31 -9
Fix topic scores in importOresTopics.php	mediawiki/extensions/GrowthExperiments	master	+13 -4
Add hasrecommendation keyword to search filters	mediawiki/extensions/CirrusSearch	master	+157 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		MMiller_WMF	T252822 [EPIC] Growth: "add a link" structured task 1.0
		Resolved		• Zbyszko	T269493 Add hasrecommendation: search keyword

Event Timeline

Tgr created this task.Dec 5 2020, 7:14 AM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptDec 5 2020, 7:14 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

There wasn't much discussion on whether this should be strictly filtering or filtering+ranking like articletopic: does (based on some kind of recommendation confidence score). The latter seems reasonable (and in that case maybe the naming should be different) but I don't think we have any actual use case for it - the new keyword would be used for suggesting tasks to users, and that needs to be done in a non-deterministic way to avoid collisions.

Tgr mentioned this in T268803: Add a link engineering: Search pipeline.Dec 5 2020, 7:18 AM

Tgr updated the task description. (Show Details)

CBogen moved this task from needs triage to Current work on the Discovery-Search board.Dec 7 2020, 4:24 PM

CBogen edited projects, added Discovery-Search (Current work); removed Discovery-Search.

CBogen set the point value for this task to 5.Dec 7 2020, 6:18 PM

CBogen moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

Gehel added a parent task: T252822: [EPIC] Growth: "add a link" structured task 1.0.Dec 7 2020, 6:19 PM

There wasn't much discussion on whether this should be strictly filtering or filtering+ranking

Filtering only seems the correct way forward here, we don't really have any information available to help decide which things are going to be better. Additionally my understanding is that the use case will involve the random sort, short circuiting any scoring done by the keyword.

We could craft a confidence score (we have scores for the individual links). But as you say we don't really have a use case for it.

CBogen added a project: Image-Suggestions.Dec 9 2020, 7:59 PM

In T269493#6678417, @Tgr wrote:

We could craft a confidence score (we have scores for the individual links). But as you say we don't really have a use case for it.

For image recommendations the Structured Data team does have a use case for confidence scores, in case that future work should factor into anything here.

(For more information see the "What parameters do we need to be able to tune for image matches?" row here.)

I'm pretty sure that doesn't apply for this flag/this part of the pipeline at all but I wanted to mention it just in case.

CBogen moved this task from To Do to Search Pipeline on the Image-Suggestions board.Dec 10 2020, 3:42 PM

In T269493#6680764, @CBogen wrote:

I'm pretty sure that doesn't apply for this flag/this part of the pipeline at all but I wanted to mention it just in case.

It applies if you want to be able to filter by score.

You could say "only recommendation with a score >= X are valid", where X might be per-wiki configurable, and then that could be handled at the service level and the search index doesn't need to know about it.
Or the score could be returned as part of the recommendation data and the client could decide what to do with it; that does not require search integration either.

However, if clients need to be able to ask for different score ranges (e.g. there should be a way of calling the API that returns recommendations with 0.5+ score and another one that only returns ones with 0.9+ score), or the API needs to rank by the score (return the highest-scoring results first), then the score needs to be in the search index.

AIUI the plan is to make the recommendation field a text field, containing special words like type_image and then doing a fulltext search within that field for those words. That's relatively easy to extend to ranking (just add those words multiple times, with the number of occurrences corresponding to the score, and the standard word frequency weighting mechanics in ElasticSearch will rank up results with a higher score). I'm not sure if it's easy to extend to filtering to specific score ranges, though. And ranking is not ideal for the task queue use case because we don't want everyone to get the same search results (although that depends on how individualized the searches done by the bot operators are otherwise).

In T269493#6683650, @Tgr wrote:

However, if clients need to be able to ask for different score ranges (e.g. there should be a way of calling the API that returns recommendations with 0.5+ score and another one that only returns ones with 0.9+ score), or the API needs to rank by the score (return the highest-scoring results first), then the score needs to be in the search index.

AIUI the plan is to make the recommendation field a text field, containing special words like type_image and then doing a fulltext search within that field for those words. That's relatively easy to extend to ranking (just add those words multiple times, with the number of occurrences corresponding to the score, and the standard word frequency weighting mechanics in ElasticSearch will rank up results with a higher score). I'm not sure if it's easy to extend to filtering to specific score ranges, though. And ranking is not ideal for the task queue use case because we don't want everyone to get the same search results (although that depends on how individualized the searches done by the bot operators are otherwise).

Structured data is definitely hoping to be able to allow bots to determine their own confidence score threshold for image recs, so the ability to ask for different score ranges is desired if it's doable.

Tagging @Ramsey-WMF as an FYI.

The analysis chain and field here is the same as used for the ores models. As such it can accept an additional integer value in 1-1000 as it's confidence score. Today since there is no information additional information provided this is set to a constant value of 1 for all pages. Were additional information to be included in the events that could be added. These confidence scores can be range filtered with a custom search query dcausse previously added to our elasticsearch plugin when creating these fields.

For this ticket it seems the keyword use case is still as a plain filter.

In T269493#6684975, @EBernhardson wrote:

The analysis chain and field here is the same as used for the ores models. As such it can accept an additional integer value in 1-1000 as it's confidence score. Today since there is no information additional information provided this is set to a constant value of 1 for all pages. Were additional information to be included in the events that could be added. These confidence scores can be range filtered with a custom search query dcausse previously added to our elasticsearch plugin when creating these fields.

For this ticket it seems the keyword use case is still as a plain filter.

Sounds good to me!

In T269493#6684975, @EBernhardson wrote:

For this ticket it seems the keyword use case is still as a plain filter.

Yes, for link recommendations we only need a boolean filter. I just wanted to make sure we don't run into problems in the long term.

CBogen moved this task from Search Pipeline to To Do on the Image-Suggestions board.Dec 16 2020, 7:30 PM

kostajh added a project: Add-Link.Jan 5 2021, 8:48 PM

• Zbyszko claimed this task.Jan 6 2021, 8:24 AM

• Zbyszko moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

Change 655083 had a related patch set uploaded (by ZPapierski; owner: ZPapierski):
[mediawiki/extensions/CirrusSearch@master] Add hasrecommendation keyword to search filters

https://gerrit.wikimedia.org/r/655083

gerritbot added a project: Patch-For-Review.Jan 8 2021, 3:01 PM

• Zbyszko moved this task from In Progress to Waiting on the Discovery-Search (Current work) board.Jan 12 2021, 1:48 PM

• Zbyszko moved this task from Waiting to In Progress on the Discovery-Search (Current work) board.Jan 12 2021, 4:36 PM

• Zbyszko moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.Jan 15 2021, 12:46 PM

kostajh moved this task from Inbox to Sprint 0 (Growth Team) on the Growth-Team board.Jan 17 2021, 8:44 PM

kostajh edited projects, added Growth-Team (Sprint 0 (Growth Team)); removed Growth-Team.

kostajh mentioned this in T265894: Add Link engineering: Local environment setup.Jan 18 2021, 2:46 PM

kostajh mentioned this in T272304: Maintenance script for setting field data.Jan 18 2021, 2:50 PM

Change 655083 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Add hasrecommendation keyword to search filters

https://gerrit.wikimedia.org/r/655083

ReleaseTaggerBot added a project: MW-1.36-notes (1.36.0-wmf.30; 2021-02-09).Feb 8 2021, 2:00 PM

Maintenance_bot removed a project: Patch-For-Review.Feb 8 2021, 2:10 PM

EBernhardson moved this task from Needs review to To Be Deployed on the Discovery-Search (Current work) board.Feb 8 2021, 11:30 PM

kostajh awarded a token.Feb 9 2021, 8:06 PM

EBernhardson moved this task from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.Feb 24 2021, 9:32 PM

Thank you @Zbyszko and Discovery-Search team!

This does not seem to be working. Here are a few pages for which GrowthExperiments generated recommendations:
https://test.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&prop=cirrusdoc&pageids=114619%7C114368%7C114510%7C113318

weighted_tags looks as expected, e.g.

"weighted_tags": [
                                "classification.ores.articletopic/Culture.Biography.Biography*|0.6265578266014",
                                "classification.ores.articletopic/Geography.Regions.Americas.North America|0.78969229536699",
                                "classification.ores.articletopic/History and Society.Politics and government|0.85337919734571",
                                "recommendation.link/exists|1"
                            ]

but a search for hasrecommendation:link finds nothing:
https://test.wikipedia.org/w/index.php?search=hasrecommendation%3Alink&title=Special:Search&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1

cirrusDumpQuery says the generated query is

"query": {
    "bool": {
        "must": [
            {
                "match_all": {}
            }
        ],
        "filter": [
            {
                "bool": {
                    "must": [
                        {
                            "match": {
                                "weighted_tags": {
                                    "query": "recommendation.link\/exists"
                                }
                            }
                        },
                        {
                            "terms": {
                                "namespace": [
                                    0
                                ]
                            }
                        }
                    ]
                }
            }
        ]
    }
},

which, to my untrained-in-Elasticsearch eyes, looks correct. Also, the search keyword works fine on my local setup.

It also works fine on beta, although we have different problems there, which might or might not be search infrastructure related: T277208: Add Link: refreshLinkRecommendations.php does not write to the search index on beta

Tgr mentioned this in T277173: Deploy Add Link on testwiki.Mar 12 2021, 12:04 PM

Gehel moved this task from Needs Reporting to Incoming on the Discovery-Search (Current work) board.Mar 12 2021, 12:24 PM

classification.ores.articletopic/History and Society.Politics and government|0.85337919734571

I wonder if this breaks anything, the value after | should be an integer between 1 and 1000 (untested, but suspicious).

In T269493#6908767, @EBernhardson wrote:

classification.ores.articletopic/History and Society.Politics and government|0.85337919734571

I wonder if this breaks anything, the value after | should be an integer between 1 and 1000 (untested, but suspicious).

Actually thats not it, @dcausse double checked and foo.bar/baz|0.1234 will be interpereted as if foo.bar/baz|0.1234|1 was provided, essentially it cant parse the value after | as an int so assumes it's part of the source.

The actual problem here is we've only run the reindex procedure against the beta cluster, prod will be coming up soon-ish. The current analytics process is currently loading data into both ores_articletopics and weighted_tags using the same formatted data, until the switchover is complete (elastic allows writing to fields that don't exist, they will simply be stored in the source document and used during the reindex that enables the field).

In T269493#6908767, @EBernhardson wrote:

classification.ores.articletopic/History and Society.Politics and government|0.85337919734571

I wonder if this breaks anything, the value after | should be an integer between 1 and 1000 (untested, but suspicious).

That's my fault, I used a script to import production ORES scores for a bunch of testwiki articles, but forgot to scale up the scores. I can fix the index data if it's worth the effort.

Change 671214 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[mediawiki/extensions/CirrusSearch@master] hasrecommendation: Write and query BC field as well

https://gerrit.wikimedia.org/r/671214

gerritbot added a project: Patch-For-Review.Mar 12 2021, 6:34 PM

In T269493#6908928, @Tgr wrote:

In T269493#6908767, @EBernhardson wrote:

classification.ores.articletopic/History and Society.Politics and government|0.85337919734571

I wonder if this breaks anything, the value after | should be an integer between 1 and 1000 (untested, but suspicious).

That's my fault, I used a script to import production ORES scores for a bunch of testwiki articles, but forgot to scale up the scores. I can fix the index data if it's worth the effort.

If the effort is minimal it would be a bit cleaner, but on review it looks like these shouldn't cause any problems beyond the predictions being unfindable.

For the problem at hand, we can either reindex testwiki today and wait for the full cluster reindex, or put a little BC code in cirrus to update/query the BC field. Which makes the most sense i suppose depends on how quickly it needs to work outside testwiki. The BC patch wasn't too difficult so I put one together, but we can reindex testwiki at any time.

In T269493#6909430, @EBernhardson wrote:

If the effort is minimal it would be a bit cleaner, but on review it looks like these shouldn't cause any problems beyond the predictions being unfindable.

It's not that minimal, we'd have to iterate through all pages and check their tags, as I didn't keep the logs and I don't think there's a way to find the affected pages via search. But it's not that huge either, and one should fix what one breaks so if you think it's problematic let me know.

For the problem at hand, we can either reindex testwiki today and wait for the full cluster reindex, or put a little BC code in cirrus to update/query the BC field. Which makes the most sense i suppose depends on how quickly it needs to work outside testwiki. The BC patch wasn't too difficult so I put one together, but we can reindex testwiki at any time.

Thanks for the quick fix! We won't need this outside testwiki for several weeks.

Mentioned in SAL (#wikimedia-operations) [2021-03-12T19:47:04Z] <ebernhardson> start in-place reindex testwiki in eqiad, codfw, cloudelastic cirrus clusters for T269493

MPhamWMF moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.Mar 15 2021, 3:46 PM

For reference BC patch by Erik: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/671214

Change 672447 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] Fix topic scores in importOresTopics.php

https://gerrit.wikimedia.org/r/672447

Change 672536 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/CirrusSearch@master] [WIP] Validate DataSender::sendUpdateWeightedTags() better

https://gerrit.wikimedia.org/r/672536

Change 672447 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Fix topic scores in importOresTopics.php

https://gerrit.wikimedia.org/r/672447

ReleaseTaggerBot edited projects, added MW-1.36-notes (1.36.0-wmf.35; 2021-03-16); removed MW-1.36-notes (1.36.0-wmf.30; 2021-02-09).Mar 15 2021, 10:00 PM

Change 671214 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] hasrecommendation: Write and query BC field as well

https://gerrit.wikimedia.org/r/671214

ReleaseTaggerBot edited projects, added MW-1.36-notes (1.36.0-wmf.36; 2021-03-23); removed MW-1.36-notes (1.36.0-wmf.35; 2021-03-16).Mar 16 2021, 4:00 PM

• Zbyszko moved this task from Needs review to To Be Deployed on the Discovery-Search (Current work) board.Mar 17 2021, 10:01 AM

Change 672536 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Validate DataSender::sendUpdateWeightedTags() better

https://gerrit.wikimedia.org/r/672536

Maintenance_bot removed a project: Patch-For-Review.Mar 19 2021, 6:11 PM

Change 673732 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/CirrusSearch@master] Fix string weight handling in CirrusSearch::updateWeightedTags

https://gerrit.wikimedia.org/r/673732

gerritbot added a project: Patch-For-Review.Mar 21 2021, 9:36 AM

Change 673732 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Fix string weight handling in CirrusSearch::updateWeightedTags

https://gerrit.wikimedia.org/r/673732

Maintenance_bot removed a project: Patch-For-Review.Mar 22 2021, 9:10 AM

kostajh moved this task from Backlog to Done / QA on the Add-Link board.Mar 22 2021, 10:46 AM

EBernhardson moved this task from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.Mar 23 2021, 6:58 PM

Gehel closed this task as Resolved.Mar 24 2021, 1:25 PM

Add hasrecommendation: search keywordClosed, ResolvedPublic5 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Add hasrecommendation: search keyword
Closed, ResolvedPublic5 Estimated Story Points
Actions

Related Objects
Search...