Use image popularity as a ranking signal for search on commons
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	Cparle
	May 29 2020, 10:32 AM

Description

Numbers of http requests for media files are stored in the data lake. We could use these as a popularity score for images, and use that as a ranking signal for image search

See here for oozie jobs that are used to import the data into cassandra for the rest api https://github.com/wikimedia/analytics-refinery/tree/master/oozie/cassandra/daily

I'd hope we could do similar to import data into the commons elasticsearch index, and then use that as a ranking signal

Event Timeline

Cparle created this task.May 29 2020, 10:32 AM

Restricted Application added a project: Structured-Data-Backlog. · View Herald TranscriptMay 29 2020, 10:32 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

CBogen edited projects, added Structured-Data-Backlog (Current Work); removed Structured-Data-Backlog.May 29 2020, 2:33 PM

Popularity information is currently shipped from analytics to production elasticsearch with an airflow job: https://github.com/wikimedia/wikimedia-discovery-analytics/blob/master/airflow/dags/transfer_to_es.py

Popularity score today is a percentage of on-wiki page views reported by wmf.pageview_hourly, calculated in https://github.com/wikimedia/wikimedia-discovery-analytics/blob/master/airflow/dags/popularity_score.py

Probably the popularity score needs to be adjusted to separately calculate a popularity score for commons using the alternate data source, drop commons from the data sourced from wmf.pageview_hourly, and then union the two sets together.

Makes sense to me ... would I be right in saying that this seems more like something you guys might implement rather than us?

Yes this is something we can take care of. Is there a timeline?

It's part of the SDAW work. No deadline atm, though it'd be good to have it ready for when we convert the mediasearch to use elastic properly (expect to have that a couple of weeks from now)

Cparle edited projects, added Discovery-Search; removed Structured-Data-Backlog (Current Work).Jun 3 2020, 6:59 PM

CBogen added a project: Structured-Data-Backlog.Jun 3 2020, 7:09 PM

CBogen moved this task from Triage to Tracking on the Structured-Data-Backlog board.

dcausse triaged this task as Medium priority.Jun 22 2020, 7:20 AM

dcausse moved this task from needs triage to ML & Data Pipeline on the Discovery-Search board.

Various groundwork has been laid for this that should allow progress to start soon-ish.

CBogen moved this task from ML & Data Pipeline to Current work on the Discovery-Search board.Aug 6 2020, 7:20 PM

CBogen edited projects, added Discovery-Search (Current work); removed Discovery-Search.

CBogen moved this task from Incoming to Waiting on the Discovery-Search (Current work) board.Aug 17 2020, 5:35 PM

waiting on our analytics<->elasticsearch pipeline to support non-content namespaces. That is waiting on a cluster restart and testing of the new support when shipping ores drafttopic predictions.

EBernhardson moved this task from Waiting to Blocked/Waiting on the Discovery-Search (Current work) board.Sep 28 2020, 5:31 PM

The Structured Data team has decided to nix the popularity filter because it isn't that useful, and has decided that the current information in incoming_links is enough for ranking, so we are closing this task.

Use image popularity as a ranking signal for search on commonsClosed, DeclinedPublicActions

Description

Event Timeline

Use image popularity as a ranking signal for search on commons
Closed, DeclinedPublic
Actions