Page MenuHomePhabricator

[SPIKE] Investigate search index delta variations
Closed, ResolvedPublic

Description

T372912: Migrate image recommendation to use page_weighted_tags_changed stream will change how image suggestions send updates to search indices. In T372912#10336830 we discussed ways to optimize them.

The main question here is to understand delta variations, which usually impact Commons. See T372912#10336730.

NOTE: we compute scores for Commons weighted tags. The intuition is that big updates happen when lots of scores are different. How important are those scores for real-world search queries? The easiest optimization could just be to drop them if they're not used.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
add Commons delta thresholdrepos/data-engineering/airflow-dags!974mfossatiT380389main
Customize query in GitLab

Event Timeline

mfossati updated the task description. (Show Details)

Filter out Commons while we figure out the importance of its weighted tags.

mfossati changed the task status from Open to In Progress.Dec 11 2024, 10:52 AM
mfossati claimed this task.

Moving to code review, but looking at Commons weighted tags usage in the meanwhile.

We're focusing here on the following weighted tags that go to the Commons search index:

  • image.linked.from.wikidata.p18/QID|SCORE
  • image.linked.from.wikidata.p373/QID|SCORE
  • image.linked.from.wikipedia.lead_image/QID|SCORE

where QID is a Wikidata item and SCORE is computed in commonswiki_file.py.

These tags seem to be used in 2 ways:

  1. implicitly - when searching the File namespace on any wiki with the WikibaseMediaInfo extension enabled, which should currently translate to when using MediaSearch or Special:Search on Commons
  2. explicitly - via custommatch:depicts_or_linked_from=QID in a Commons search query

Example:

...

{
	"match": {
		"weighted_tags": {
			"query": "image.linked.from.wikidata.p18\/Q483407",
			"boost": 2.342153943085914
		}
	}
},
{
	"match": {
		"weighted_tags": {
			"query": "image.linked.from.wikidata.p373\/Q483407",
			"boost": 4.386547106798469
		}
	}
},
{
	"match": {
		"weighted_tags": {
			"query": "image.linked.from.wikipedia.lead_image\/Q483407",
			"boost": 4.795040181639707
		}
	}
}

...
  • explicit - search query = custommatch:depicts_or_linked_from=Q483407, where Q483407 is the Wikidata item for Ramones.
...

{
	"match": {
		"weighted_tags": {
			"query": "image.linked.from.wikidata.p18\/Q483407",
			"boost": 1.984794590275781
		}
	}
},
{
	"match": {
		"weighted_tags": {
			"query": "image.linked.from.wikidata.p373\/Q483407",
			"boost": 5.739424158364518
		}
	}
},
{
	"match": {
		"weighted_tags": {
			"query": "image.linked.from.wikipedia.lead_image\/Q483407",
			"boost": 3.74983393205065
		}
	}
}


...

Note that these queries yield slightly different results.

Almost no search queries on Commons contain custommatch:depicts_or_linked_from:

def collect_searches(spark):
    initial_query = """SELECT http, params
    FROM event.mediawiki_cirrussearch_request
    WHERE database='commonswiki' AND params IS NOT NULL
    """
    ddf = spark.sql(initial_query)
    filtered = (
        ddf
        .where(
            ddf.params.title.contains('Special:Search') | ddf.params.title.contains('Special:MediaSearch')
        )
        .where(
            ddf.http.request_headers.referer.contains('index.php')
        )
    )

    return filtered

from wmfdata.spark import create_session
spark = create_session(app_name='commons-weighted-tags', type='yarn-large')
ddf = collect_searches(spark)
kw = 'custommatch:depicts_or_linked_from='
wt = ddf.where(ddf.params.search.contains(kw))
wt.count()

4
mfossati added a subscriber: Sneha.

Moving to needs design, discussion with engineers needed.
FYI @Sneha you can safely ignore this ticket.