Page MenuHomePhabricator

Duplicate keys in x_analytics header corrupt some wmf_raw.webrequest rows and break refinement of wmf.webrequest
Closed, ResolvedPublic3 Estimated Story Points

Description

Airflow sends a failure alert for the dag refine_webrequest_hourly_text and task reine_werequest concerning a specific dag run (1 hour of data).
The logs say:

java.lang.RuntimeException: Duplicate map key preview was found, please check the input data.

The problem happens when the job tries to read wmf_raw.webrequest and parse the x_analytics field (string) as a map with str_to_map(x_analytics, '\;', '=').
The x_analytics string values that cause the error have the same key more than once, like:

preview=1;ns=0;page_id=46350;include_pv=1;preview=1;https=1;client_port=1588

In this case, it was preview. It also happened with the key pageview.
The number of affected rows in a wmf_raw.webrequest hourly partition has been so far between 1 and 2.
During the last couple months, we've seen this issue several times. Only the last week it happened 6 times.

We can manually re-run the job by filtering out the corrupted rows with the excluded_row_ids mechanism.
But it is a manual task that takes time from us, and delays all the webrequest/pageview dependent pipelines.
Thus, we should fix this permanently.


A possible solution is to set the spark option in the code spark.sql.mapKeyDedupPolicy=LAST_WIN, so that duplicate keys do not break the computation.
This would let the last duplicated key-value pair to be the effective one when parsing the x_analytics string into a map.

Pro: it's the simplest solution, one-liner.
Con: it could hide issues with the x_analytics header (and other str_to_map operations in the query - TLS field).
Compromise: If we chose to go this way, we could add checks for duplicates in an external data quality check that we can follow up with.

Event Timeline

Just had a chat with @JAllemandou , this could be a good use case for T349763: [Data Quality] Develop Airflow post processing instrumentation to collect and log configurable data metrics.

Compromise: If we chose to go this way, we could add checks for duplicates in an external data quality check that we can follow up with.

This case is interesting because it stretches some of the assumptions wrt the data validation library we are prototyping with. With some regexp-foo we might be able to implement validation with a Check.satisfies constraint., and possibly run a CustomSql analyzer (not sure how useful).

@mforns what kind of information would you need to help troubleshooting? Would knowing that a partition contains duplicate records be sufficient as a trigger for further investigation?

@mforns what kind of information would you need to help troubleshooting? Would knowing that a partition contains duplicate records be sufficient as a trigger for further investigation?

Yes, I think knowing whether the partition contains duplicates is good. Maybe having the number of rows with duplicates would be useful, since it allows us to know if the issue is serious or it can be ignored.

Discussed during Data Engineering standup: let's fix with spark.sql.mapKeyDedupPolicy=LAST_WIN, and implement the metric monitoring how often it happens :)

Change 977774 had a related patch set uploaded (by Mforns; author: Mforns):

[analytics/refinery@master] Set spark.sql.mapKeyDedupPolicy to LAST_WIN in refine_webrequest

https://gerrit.wikimedia.org/r/977774

Change 977774 merged by Milimetric:

[analytics/refinery@master] Set spark.sql.mapKeyDedupPolicy to LAST_WIN in refine_webrequest

https://gerrit.wikimedia.org/r/977774

Milimetric subscribed.

merged and deployed right now, used to fix another instance of the webrequest duplicate map key failures. Note for future selves: it would be good to figure out where these are coming from still.

… and implement the metric monitoring how often it happens :)

… Note for future selves: it would be good to figure out where these are coming from still.

Is it possible to have the monitoring log some information about the rows such that we can figure out where they're coming from?

Is it possible to have the monitoring log some information about the rows such that we can figure out where they're coming from?

Not in any easy way that I know, because we use spark-sql and it's hard to output to multiple things at the same time, let alone logs. With pyspark or something we would have the dataframe, we could analyze it first and output exceptions like this into one place, then process the rest.

@gmodena is working on adding data-quality metrics on the webrequest dataset (https://phabricator.wikimedia.org/T349763), and we added this one (duplicate map keys) in the list of things to have in the POC.
We shall have some data one of those days :)

If the keys and values are the same, LAST_WIN is great. If the value is different, there is likely problem, and we should alert on it.

In either case, I think the right behavior is LAST_WIN, with alerting if duplicate keys with different values.

Or...maybe just document that last always wins and call it a day?

Looks like there were duplicate x_analytics keys in December data. The following wmf.webrequest partitions have been flagged as buggy (detected during an integration test):

year=2023/month=12/day=24/hour=6
year=2023/month=12/day=22/hour=4
year=2023/month=12/day=6/hour=11

The following wmf.webrequest partitions have been flagged as buggy (detected during an integration test):

FWIW: this check accounts for any x_analytics entry with duplicate keys, regarding of their value. I bounced some ideas around with @JAllemandou earlier today and next to updating the check semantic (and account for duplicate keys with distinct values) we want to track the following cases:

  1. alert (warn level) on duplicate keys regardless of value.
  2. alert (error level) on duplicate keys with the same value (this would break LAST_WIN semantic)

The reason to track both cases is to get insights about eventual quality degradation of x_analytics entries over time, but inform to take action only on cases that break LAST_WIN.

Change 988985 had a related patch set uploaded (by Gmodena; author: Gmodena):

[analytics/refinery/source@master] refinery: log data quality alert severity

https://gerrit.wikimedia.org/r/988985

Change 988985 merged by Gmodena:

[analytics/refinery/source@master] refinery: log data quality alert severity

https://gerrit.wikimedia.org/r/988985