Page MenuHomePhabricator

Use image popularity as a ranking signal for search on commons
Closed, DeclinedPublic

Description

Numbers of http requests for media files are stored in the data lake. We could use these as a popularity score for images, and use that as a ranking signal for image search

See here for oozie jobs that are used to import the data into cassandra for the rest api https://github.com/wikimedia/analytics-refinery/tree/master/oozie/cassandra/daily

I'd hope we could do similar to import data into the commons elasticsearch index, and then use that as a ranking signal

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Popularity information is currently shipped from analytics to production elasticsearch with an airflow job: https://github.com/wikimedia/wikimedia-discovery-analytics/blob/master/airflow/dags/transfer_to_es.py

Popularity score today is a percentage of on-wiki page views reported by wmf.pageview_hourly, calculated in https://github.com/wikimedia/wikimedia-discovery-analytics/blob/master/airflow/dags/popularity_score.py

Probably the popularity score needs to be adjusted to separately calculate a popularity score for commons using the alternate data source, drop commons from the data sourced from wmf.pageview_hourly, and then union the two sets together.

Makes sense to me ... would I be right in saying that this seems more like something you guys might implement rather than us?

Yes this is something we can take care of. Is there a timeline?

It's part of the SDAW work. No deadline atm, though it'd be good to have it ready for when we convert the mediasearch to use elastic properly (expect to have that a couple of weeks from now)

dcausse triaged this task as Medium priority.Jun 22 2020, 7:20 AM
dcausse moved this task from needs triage to ML & Data Pipeline on the Discovery-Search board.

Various groundwork has been laid for this that should allow progress to start soon-ish.

waiting on our analytics<->elasticsearch pipeline to support non-content namespaces. That is waiting on a cluster restart and testing of the new support when shipping ores drafttopic predictions.

CBogen subscribed.

The Structured Data team has decided to nix the popularity filter because it isn't that useful, and has decided that the current information in incoming_links is enough for ranking, so we are closing this task.