Page MenuHomePhabricator

Create tool and process to investigate Search update Pipeline failures
Closed, ResolvedPublic8 Estimated Story Points

Description

See parent task for context. The goal of this task is to have the appropriate tooling and processes to make investigating data pipeline issues not too tedious.

Minimal implementation could be just dumping a json list of related events based on pageID / request ID / ...

A more complex implementation might have superset dashboards, and nice interfaces

Event Timeline

Gehel triaged this task as High priority.Feb 5 2024, 4:32 PM
Gehel set the point value for this task to 8.
Gehel updated the task description. (Show Details)

Idea is something like:

  • User provides one or more pieces of context they would like to investigate:
    • time range
    • wiki id + page id
    • wiki id + page namespace + title
    • debug request-id
  • System collects all related events from the topics SUP reads and writes to and displays them to the user

Current process (to be refined). None of this is committed anywhere yet, mostly working out what is going to work.

  1. A series of scripts that run on mwlog instance:
    • parses CirrusSearch.log to look for messages from LogOnlyRemediator (saneitizer checking commonswiki)
    • checks if the error still exists on cloudelastic, and if it exists on eqiad
    • reports only errors that currently exist in cloudelastic and do not exist in eqiad
  2. A jupyter notebook
    • Accepts a few properties to look for such as (domain, page_title) and a time range
    • Collects matching events from the producer output stream, uses the found request-id's to expand the list of properties to look for
    • Queries all input topics to find events matching any known request id, (domain, page_id) or (domain, page_title) pairs.
    • Reports them out, and writes a json file containing the events.
  3. A junit test case
    • Reads in the events from the jupyter notebook outputs
    • Runs them through the DeduplicateAndMerge step and reports the result.
    • With any luck this recreates the incorrect events from above, and allows writing a fix.