Page MenuHomePhabricator

Reduce Commons' search index delta size
Closed, ResolvedPublic

Description

In T372912#10336684, the Search team highlighted that large search index updates can be ~40x bigger than usual for Commons, and asked if this can be understood and optimized.
The intuition is that big updates happen when lots of weighted tag scores are different.

Possible solutions:

  • compute the delta after stripping the score from weighted tags
  • round the score to reduce variation at compute time, i.e., round(x / 10) * 10 or (1 + floor(x / 10)) * 10 - we decided to opt for this one
  • the delta computation is left intact

Then, talk to Search to ingest the initial big delta.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
new lead image score rounding formularepos/structured-data/image-suggestions!53mfossatiimprove-roundingmain
reduce the variation of relevance scoresrepos/structured-data/image-suggestions!50mfossatiT384514main
Customize query in GitLab

Event Timeline

mfossati changed the task status from Open to In Progress.Feb 18 2025, 11:17 AM
mfossati claimed this task.

We ended up implementing a rounded log instead. It's not as efficient as the one we were initially considering, but it's much better at keeping the relevant nuance in the scores. If it turns out to insufficiently compress the delta, we can simply increase the log base to decrease the spread.

mfossati added a subscriber: dcausse.

Commons delta row counts of last two scheduled runs:

select count(*) from analytics_platform_eng.image_suggestions_search_index_delta where snapshot='2025-03-24' and wikiid='commonswiki';
112205

select count(*) from analytics_platform_eng.image_suggestions_search_index_delta where snapshot='2025-03-17' and wikiid='commonswiki';
122904

CC @dcausse .

Looks good, closing.

@mfossati thanks! Do these two snapshots use the same dumps? if yes we might perhaps wait for a run that uses different dump and see?

Additionally if we're confident that we'll always have less than 500k tags/week we might consider unblocking T372912: Migrate image recommendation to use page_weighted_tags_changed stream

@mfossati thanks! Do these two snapshots use the same dumps?

Yes, they both use the usual inputs. FYI @Cparle is working towards consuming new weekly inputs in T389516: [L] Update ALIS to use wmf_content.mediawiki_content_history_v1 instead of wmf.*.

if yes we might perhaps wait for a run that uses different dump and see?

Sounds good.

Additionally if we're confident that we'll always have less than 500k tags/week we might consider unblocking T372912: Migrate image recommendation to use page_weighted_tags_changed stream

I suggest to let the new logic run for a while and keep an eye on the numbers before confirming.