Page MenuHomePhabricator

Collect a set of representative queries for the benchmark dataset
Closed, ResolvedPublic

Description

The goal of this task is to collect a set of queries for the benchmark dataset.
These should be representative of the different query types we identified in T407603
These queries will later be used for annotating.

Potential resources:

Event Timeline

  • Finalized the logic for query selection based on the search logs (Document that observes the full logic). Added the filter to avoid matching page titles with the query (navigational queries).
  • Collected an initial set of queries for the pilot experiment (notebook with query selection logic)
  • Performed manual evaluation of a small random subset of queries to confirm selection quality (only ~1-4% of queries to be manually filtered)
  • Refactored the query selection logic to enable execution of the full pipeline in PySpark, allowing processing of the complete 90-day log corpus. Applied a minimum threshold of 25 identities per query.
  • Conducted a manual review of the collected queries to assess the presence of PII; none was identified after applying the 25-identity threshold. The review also informed potential improvements to the selection logic.
  • Following the initial review, implemented additional post-processing steps, including near-duplicate removal using Levenshtein distance (e.g., <film name> episode 2 vs. <film name> episode 20) and automated filtering of patterns associated with automated parsing of non-notable companies. While this step removes the majority of such cases, some manual post-filtering may still be required.

final update: