Page MenuHomePhabricator

Create spreadsheet of last 90 days of Commons search queries
Closed, ResolvedPublic

Description

The Structured Data team is looking for a list of the last 90 days of Commons search queries so that we can manually review the data to understand what popular searches are on Commons and create "concept chips" of related searches.

The list should have the top 10,000 queries (full queries, not individual search terms) from the last 90 days. It's okay if it's not filtered otherwise.

We would also love to see the queries grouped/categorized. Erik mentioned that this was possible and that he'd look for the old code to do so.


Final report: https://superset.wikimedia.org/superset/dashboard/154/

Event Timeline

Change 614874 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[wikimedia/discovery/analytics@master] Add report on top N queries for given wiki

https://gerrit.wikimedia.org/r/614874

In order to deliver this report can we put these queries in a table in hadoop? That way the table can be sourced via presto on superset and any stakeholder that looks at the data can do so via superset ui. This is the best method to deliver reports to users that need to engage with the reports but not the raw data. A kerberos user and analytics-private-data permits (which are sensitive permits for users not familiar with the systems) should not be needed to access what is ,essentially, a report.

Once we have a table the only thing needed to see data in superset is (via presto and sql lab)

select * from table

This last SQL lab step, I promise, is painless.

I wasn't aware of the presto integration, that sounds useful. Yes I can push these reports into a hive table pretty easily. Thanks!

https://superset.wikimedia.org/superset/dashboard/154/

Report is currently static data, but once the above patch is merged it will self-update daily.

Change 614874 merged by jenkins-bot:
[wikimedia/discovery/analytics@master] Add report on top N queries for given wiki

https://gerrit.wikimedia.org/r/614874

Scheduling deployed, ran first two days, appears to be working. Updated linked superset dashboard to point to the official tables, at discovery.fulltext_head_queries.