Page MenuHomePhabricator

Hybrid (Semantic + Lexical) Search MVP on Android
Open, Needs TriagePublic

Description

Description

Build and test a Hybrid Search MVP in the Wikipedia Android app that displays lexical and semantic retrieval to improve how readers discover and navigate information—especially for intent-driven and natural-language queries—while preserving trust, performance, and transparency.

This epic delivers a single experimental search surface backed by dual retrieval configurations, enabling rapid A/B testing of result formats, snippet types, and provenance indicators. The goal is not feature parity or long-term rollout, but validated learning that informs whether hybrid search should scale, pivot, or be abandoned.

Problem Statement

Readers use on-wiki search heavily in the apps. However reader patterns overall across platform display abandonment for external search engines when queries are intent-driven, exploratory, or phrased in natural language. Discovery of advanced search capabilities is low, and current keyword-based retrieval does not consistently surface the most relevant sections or context for these queries.

This creates friction in discovery, limits depth of engagement, and weakens retention—particularly among logged-out and casual readers.

Goal

Validate whether hybrid search (semantic + lexical) paired with focused UX patterns can:

  • Increase reader efficiency in finding relevant information
  • Encourage deeper exploration without forcing full-article reading
  • Improve short-term engagement and return behavior
  • Do so without harming trust, performance, or overall app retention

Gather data for evaluation and to inform an experience where users wouldn't have to select which retrieval method was more relevant to query type, but would enable us to provide the most relevant result across retrieval methods.

Hypothesis

If we test hybrid search MVP variants that support keyword, natural-language, and meaning-based retrieval in the Android app, we can identify which retrieval and design patterns are most likely to increase total search engagement by ~10% (sessions initiated × average session length), without negative impact on satisfaction, latency, or retention.

In Scope (MVP)

Platform & Audience
  • Wikipedia Android app
  • Logged-out readers (although we will also expose the feature to logged-in readers for metric curiosities)
  • EN, FR, PT Wiki
Search & Retrieval
  • Existing lexical search vs hybrid (semantic + lexical) retrieval
  • Entry points for search should remain the same
  • Support predictive queries using our existing systems
  • Encourage semantic queries through onboarding and other creative means (placeholder text in search)
  • Search can only be initiated on Go or Click
Result Presentation
  • Semantic results should show section or paragraph level snippets
  • Ideally (strongly preferred) when user clicks on semantic result it takes them to the correct section of the article with highlights
  • Provenance indicators (article source and clear indicators content is from human-generated articles)
  • Ideally (strongly preferred) layout has two variants to account for first level bias
  • Explicit experiment labeling and opt-out ability
Experimentation & Instrumentation
  • Feature flagging / remote config for rapid iteration
  • Experiments last no longer than 14 days so we can consider two iterations
  • AB testing with our existing experience as control and a possible C variant for variable UI experiences
  • Full query, result, click and satisfaction should be logged

Full measurement plan available here

Out of Scope

  • Generative Q&A or answer synthesis
  • Query prediction or auto-generation
  • Cross-wiki or cross-language retrieval
  • Personalization or user profile ranking
  • Summarization
  • Long-term UX polish or feature hardening

Success Metrics & Guardrails

Primary Success Signals
  • Increase reader engagement with on-wiki search: achieve a X% increase in the number of search sessions initiated per unique user compared to the lexical baseline.
  • Increase depth of engagement: achieve a X% increase in average session length (from first query to last interaction).
    • Supporting indicator: track queries per search session, segmented into:
      • Exploration queries occur after a successful click + normal dwell time (healthy curiosity / rabbitholing).
      • Reformulation queries near-identical or quickly repeated queries after no click or short dwell (friction).
  • Increase efficiency of discovering content: reduce median time-to-click by X% compared to lexical baselines.
    • Supporting indicators: lower reformulation rate and higher click-through on top-ranked results.
    • Track good abandonment separately as searches with no reformulation or exit within the observation window (indicating the user likely found what they needed in the snippet or preview).
  • Increase perceived relevance and satisfaction: at least 80% of users agree that search results are relevant and satisfactory.
  • Increase retention of logged-out users: raise the seven-day search return rate for logged-out readers by 5% versus baseline. (KR alignment)
  • Generate validated learning data: capture at least 90% of MVP query samples with full logs and annotations to feed the shared evaluation dataset.

Note: The observation window for good abandonment will be set using baseline behavior.

Qualitative Signals
  • ≥ 80% perceived relevance & satisfaction
  • ≥ 85% correct identification of content provenance
  • ≤ 5% negative trust feedback

There will be partnership with Design Research to gain an understanding of qualitative signals.

Guardrails
  • Median search latency does not increase >15% vs baseline
  • No statistically significant drop in overall app retention

Deliverables

  • Hybrid search endpoint integrated into Android app
  • Configurable search UI supporting multiple experiment variants
  • Instrumentation sufficient to capture ≥90% of MVP queries with annotations
  • Experiment readout summarizing engagement, trust, performance, and retention
  • Community consultation of results
  • Recommendation to scale, pivot, or stop based on evidence

Ideally we commence work in Q3 of FY25-26 but this is dependent on feedback from the community

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenSNowick_WMF
ResolvedTLessa-WMF
OpenWRai-WMF
Opencooltey
OpenWRai-WMF
Opencooltey
Opencooltey
Opencooltey
Opencooltey
OpenWRai-WMF
OpenDbrant
Opencooltey
OpenWRai-WMF
OpenWRai-WMF
OpenNone

Event Timeline

Translation sheet here (translations TBD): https://docs.google.com/spreadsheets/d/1d7GfFyeftBPE_I7fH74kjrE_3owJrJQgfKSds1KPxHw/edit?usp=sharing

Reminder: We will need the copy explaining this experiment on the MediaWiki page so that we can unblock translations to PT and FR. Flow showing the paths that link to the page here: https://www.figma.com/design/JlBBR9rVwHyZlUvl5BpKpL/Android---%3E-Semantic-Search-MVP?node-id=7410-49093&t=xuX5FORTa3yuSNQP-1

I had to update the translation sheet and put in a new request since we now need it translated into Greek. The new sheet has the latest design updates: https://docs.google.com/spreadsheets/d/1d7GfFyeftBPE_I7fH74kjrE_3owJrJQgfKSds1KPxHw/edit?usp=sharing

Next step: send to Bethany for translation - we chatted, and she will support us on that.