⚓ T197128 Review current search metrics for accuracy and documentation

Subject	Repo	Branch	Lines +/-
Search dashboard audit	wikimedia/discovery/rainbow	develop	+94 -19
Refactor load_times.R to avoid using beeline to query	wikimedia/discovery/golden	master	+69 -62
Add magrittr package in sample_page_visit_ld.R	wikimedia/discovery/golden	master	+4 -1
Change SQL queries using MobileWikiAppSearch table to Hive queries	wikimedia/discovery/golden	master	+99 -90
Add chelsyx to analytics-search-users group	operations/puppet	production	+1 -1
Fix bugs in survival analysis	wikimedia/discovery/golden	master	+37 -15
Fix paulScore for autocomplete searches	wikimedia/discovery/golden	master	+1 -1

Status	Assigned	Task
Declined	None	T197138 [Epic] Improve search metrics and dashboards
Resolved	• chelsyx	T197128 Review current search metrics for accuracy and documentation
Resolved	• chelsyx	T209537 Review a few more current metrics for accuracy

• EBjune created this task.Jun 13 2018, 3:29 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 13 2018, 3:29 PM

• EBjune added a parent task: T197138: [Epic] Improve search metrics and dashboards.Jun 13 2018, 3:55 PM

TJones updated the task description. (Show Details)Jun 13 2018, 6:28 PM

TJones subscribed.Jun 14 2018, 5:48 PM

• chelsyx added a project: Product-Analytics.Jun 14 2018, 8:14 PM

• chelsyx moved this task from Triage to Backlog on the Product-Analytics board.

• chelsyx subscribed.Jun 14 2018, 9:05 PM

• Vvjjkkii renamed this task from Review current search metrics for accuracy and documentation to j3aaaaaaaa.Jul 1 2018, 1:04 AM

• Vvjjkkii triaged this task as High priority.

• Vvjjkkii added projects: CheckUser, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), Tamil-Sites, Gamepress, Hashtags, Jade, KartoEditor, Language-2018-Apr-June, New-Editor-Experiences, Mail, TCB-Team (now WMDE-TechWish).

• Vvjjkkii updated the task description. (Show Details)

• Vvjjkkii removed a subscriber: Aklapper.

CommunityTechBot renamed this task from j3aaaaaaaa to Review current search metrics for accuracy and documentation.Jul 2 2018, 1:52 PM

CommunityTechBot raised the priority of this task from High to Needs Triage.

CommunityTechBot updated the task description. (Show Details)

CommunityTechBot removed projects: TCB-Team (now WMDE-TechWish), Mail, New-Editor-Experiences, Language-2018-Apr-June, KartoEditor, Jade, Hashtags, Gamepress, Tamil-Sites, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), CheckUser.

CommunityTechBot added a subscriber: Aklapper.

• chelsyx claimed this task.Aug 2 2018, 4:57 PM

• chelsyx triaged this task as Medium priority.

• chelsyx moved this task from Backlog to Doing on the Product-Analytics board.

• chelsyx updated the task description. (Show Details)Sep 20 2018, 11:57 PM

• chelsyx updated the task description. (Show Details)Sep 21 2018, 12:09 AM

• chelsyx updated the task description. (Show Details)Sep 21 2018, 12:13 AM

Change 462032 had a related patch set uploaded (by Chelsyx; owner: Chelsyx):
[wikimedia/discovery/golden@master] Fix bugs in survival analysis

https://gerrit.wikimedia.org/r/462032

gerritbot added a project: Patch-For-Review.Sep 21 2018, 10:59 PM

• chelsyx updated the task description. (Show Details)Sep 22 2018, 2:45 AM

• chelsyx updated the task description. (Show Details)Sep 22 2018, 4:59 AM

• chelsyx updated the task description. (Show Details)Sep 24 2018, 6:59 PM

Change 462606 had a related patch set uploaded (by Chelsyx; owner: Chelsyx):
[wikimedia/discovery/golden@master] Fix paulScore for autocomplete searches

https://gerrit.wikimedia.org/r/462606

• chelsyx updated the task description. (Show Details)Sep 25 2018, 6:21 AM

• chelsyx updated the task description. (Show Details)Sep 25 2018, 5:18 PM

• EBjune moved this task from Tests & Analysis to not in use - please delete on the Discovery-Search (Current work) board.Sep 25 2018, 5:49 PM

Change 462606 merged by Bearloga:
[wikimedia/discovery/golden@master] Fix paulScore for autocomplete searches

https://gerrit.wikimedia.org/r/462606

For the anomalies, if there isn't time (yet, or at all) to investigate them individually, would it be possible to create a sampling tool that would let us look for obvious skews in the usual usage stats?

I'm imagining a tool where you specify a day, and probably a wiki, and possibly other parameters to narrow down the scope, and you get back a frequency list of the top 100 pages visited, a histogram of page visit times, histograms of search session length and number of queries plus a list of the top 100 most extreme, a count of bot vs non-bot searches, the top 100 IPs issuing queries, the top 100 user agents, the top 100 queries, the top 100 referrers, etc.

Comparing two or three days during "normal" times right before an anomaly, and two or three days during the anomaly could reveal likely culprits. The page of a celebrity who recently died, and related pages, suddenly got a lot of traffic. The proportion of identified bots doubled. One IP address issued ten times as many queries as the next two or three combined. A single search session lasted the entire 24 hours and had 200,000 queries. Traffic from Reddit spiked. A specific query, or a bunch of related queries, jumped to the top of the list—the latter indicating that people are interested in a topic, or the former indicating that people may be following a link.

These kinds of stats don't give specific answers, but they do point to external causes. So at least we'd know that something happened during a given anomaly, even if we didn't have all the details.

This tool (or at least part of it) would have to be internal-only since IP addresses, user agents, and queries are all potential PII. It might also be fairly expensive to run, so we might not want to make it widely available.

OTOH, if building such a tool is more work than investigating 100 anomalies, then it probably isn't worth it in the short term.

Does anyone else think this would help, or should we just let mysteries be mysteries?

Change 462032 merged by Chelsyx:
[wikimedia/discovery/golden@master] Fix bugs in survival analysis

https://gerrit.wikimedia.org/r/462032

Change 463517 had a related patch set uploaded (by Bearloga; owner: Bearloga):
[operations/puppet@production] Add chelsyx to analytics-search-users group

https://gerrit.wikimedia.org/r/463517

Change 463543 had a related patch set uploaded (by Chelsyx; owner: Chelsyx):
[wikimedia/discovery/golden@master] Change SQL queries using MobileWikiAppSearch table to Hive queries

https://gerrit.wikimedia.org/r/463543

• chelsyx updated the task description. (Show Details)Sep 28 2018, 7:51 PM

Change 463575 had a related patch set uploaded (by Chelsyx; owner: Chelsyx):
[wikimedia/discovery/golden@master] Add magrittr package in sample_page_visit_ld.R

https://gerrit.wikimedia.org/r/463575

Change 463517 merged by Ottomata:
[operations/puppet@production] Add chelsyx to analytics-search-users group

https://gerrit.wikimedia.org/r/463517

Change 463543 merged by Chelsyx:
[wikimedia/discovery/golden@master] Change SQL queries using MobileWikiAppSearch table to Hive queries

https://gerrit.wikimedia.org/r/463543

Change 463575 merged by Chelsyx:
[wikimedia/discovery/golden@master] Add magrittr package in sample_page_visit_ld.R

https://gerrit.wikimedia.org/r/463575

Change 467866 had a related patch set uploaded (by Chelsyx; owner: Chelsyx):
[wikimedia/discovery/golden@master] Refactor load_times.R to avoid using beeline to query

https://gerrit.wikimedia.org/r/467866

Change 467866 merged by Chelsyx:
[wikimedia/discovery/golden@master] Refactor load_times.R to avoid using beeline to query

https://gerrit.wikimedia.org/r/467866

Change 468089 had a related patch set uploaded (by Chelsyx; owner: Chelsyx):
[wikimedia/discovery/rainbow@develop] Search dashboard audit

https://gerrit.wikimedia.org/r/468089

Change 468089 merged by Chelsyx:
[wikimedia/discovery/rainbow@develop] Search dashboard audit

https://gerrit.wikimedia.org/r/468089

All the changes have been merged. Please check https://discovery.wmflabs.org/metrics/ and let me know if there is any questions.

@TJones Thanks for the suggestions! The tool you suggested sounds helpful, but as you mentioned, it may be fairly expensive to run and contains PII, which require some authentication tool. Druid, Superset/Turnilo may be the solution for this kind of tool, but I'm still learning them and not sure if they can be helpful to our problem.

Meanwhile, some of the dashboards we already have can be helpful to unveil the cause of some anomalies. For example, we saw direct usage of full-text search via API increase since 3/22, and we also saw the fulltext ZRR including bots increase around the same time (it didn't increase if we excluded bots on the dashboard). This suggest that the spike we saw in full-text search via API is likely due to bot behavior.

Additionally, most of the mysteries I've seen so far are the result of internal cause -- bugs in our data retrieval script, changes in analytic engineering team's parsing script (e.g. the weird pattern on morelike search and prefix search since Apr 1st seems to be the result of that direct referred traffic are re-categorized as internal traffic), or changes made by the front-end team (mobile web, ios/android app). If we can find out ways to build a better communication channel with those teams that use our search services, and let them notify us when related changes occur and we keep a log of these changes, I think that would be more helpful in understanding the mysteries.

Sounds good, @chelsyx —thanks for the updates and all the fixes!

I think we can close this ticket. The general consensus seems to be that the "anomalies" are mostly not errors and are just unexplained variation in usage patterns, which we don't necessarily need to track down. (There's one that still bothers me, though.)

I've found one more concern on the dashboard pages, so I will open a new ticket for the one outstanding anomaly/bug and the other new possible bug.

Thanks so much for all the hard work, @chelsyx!

No problem @TJones !

• Nuria subscribed.Feb 4 2019, 10:11 PM

kzimmerman closed subtask T209537: Review a few more current metrics for accuracy as Resolved.Jun 18 2019, 9:49 PM

Review current search metrics for accuracy and documentation
Closed, ResolvedPublic
Actions

Description

Issues and anomalies

Details

Related Objects
Search...

Event Timeline

Review current search metrics for accuracy and documentationClosed, ResolvedPublicActions