Page MenuHomePhabricator

Mjolnir feature collection failing in mjolnir_weekly Airflow DAG
Closed, ResolvedPublic5 Estimated Story Points

Description

Mjolnir has been failing feature collection for several weeks in a row now. The most recent run finishes with:

Exception: Did not collect equal number of rows per feature

Figure out what happened to feature collection and get everything running again.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
search: query_clicks: Parameterize max_active_runs for backfillsrepos/data-engineering/airflow-dags!2078ebernhardsonwork/ebernhardson/query-clicks-parameterizemain
search: query-clicks: Repair null timestampsrepos/data-engineering/airflow-dags!2075ebernhardsonwork/ebernhardson/query-clicks-null-timestampmain
Customize query in GitLab

Event Timeline

pfischer set the point value for this task to 5.Jan 26 2026, 4:36 PM
dr0ptp4kt renamed this task from Mjolnir feature collection failing to Mjolnir feature collection failing in mjolnir_weekly Airflow DAG.Mar 4 2026, 4:28 PM

Had a bit of time to start looking into this, some findings:

  • Feature collection fails because the input querys are empty
  • Input queries are empty because query_clicks_ltr filters by session count, but all the session_id's are null
  • The first partition that is missing session_id's is discovery.query_clicks_daily/year=2025/month=8/day=28
  • Those partitions are generated by an hql query, the query and the airflow dag that issues it hasn't been changed since may 2025.

Not sure yet what changed, but we have a direction to look in. Not clear what we can do about all the missing session_id's at this point, likely they can't be reconstructed as we don't retain enough data.

Problem was traced down to null timestamps coming out of query_clicks_hourly. This was due to an overly specific format specifier and the source data adding millisecond precision to the timestamp. Timestamp conversion was changed to a more permissive conversion. The last three months of query_clicks_hourly and query_clicks_daily were backfilled. mjolnir dag was unpaused and completed a run.

I was curious about changes to the training loss, so I put together a graph of the final ndcg achieved for each weekly run since 2018. For the most part it shows training is mostly back to historical norms.

@EBernhardson / @pfischer: The only associated project tag has been archived. If this task is done, please change its status to resolved; if it is not done, please associate an active project tag. Thanks!

EBernhardson claimed this task.