Maniphest T351909

Duplicate keys in x_analytics header corrupt some wmf_raw.webrequest rows and break refinement of wmf.webrequest
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	mforns
	Nov 23 2023, 7:04 PM

Description

Airflow sends a failure alert for the dag refine_webrequest_hourly_text and task reine_werequest concerning a specific dag run (1 hour of data).
The logs say:

java.lang.RuntimeException: Duplicate map key preview was found, please check the input data.

The problem happens when the job tries to read wmf_raw.webrequest and parse the x_analytics field (string) as a map with str_to_map(x_analytics, '\;', '=').
The x_analytics string values that cause the error have the same key more than once, like:

preview=1;ns=0;page_id=46350;include_pv=1;preview=1;https=1;client_port=1588

In this case, it was preview. It also happened with the key pageview.
The number of affected rows in a wmf_raw.webrequest hourly partition has been so far between 1 and 2.
During the last couple months, we've seen this issue several times. Only the last week it happened 6 times.

We can manually re-run the job by filtering out the corrupted rows with the excluded_row_ids mechanism.
But it is a manual task that takes time from us, and delays all the webrequest/pageview dependent pipelines.
Thus, we should fix this permanently.

A possible solution is to set the spark option in the code spark.sql.mapKeyDedupPolicy=LAST_WIN, so that duplicate keys do not break the computation.
This would let the last duplicated key-value pair to be the effective one when parsing the x_analytics string into a map.

Pro: it's the simplest solution, one-liner.
Con: it could hide issues with the x_analytics header (and other str_to_map operations in the query - TLS field).
Compromise: If we chose to go this way, we could add checks for duplicates in an external data quality check that we can follow up with.

Details

	Subject	Repo	Branch	Lines +/-
	refinery: log data quality alert severity	analytics/refinery/source	master	+412 -99
	Set spark.sql.mapKeyDedupPolicy to LAST_WIN in refine_webrequest	analytics/refinery	master	+3 -0

Customize query in gerrit

Related Objects

Mentioned In: T354568: [Data Quality][Webrequest] Log severity level of alerts generated by refinery
T349763: [Data Quality] Develop Airflow post processing instrumentation to collect and log configurable data metrics
Mentioned Here: T349763: [Data Quality] Develop Airflow post processing instrumentation to collect and log configurable data metrics

Event Timeline

mforns created this task.Nov 23 2023, 7:04 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 23 2023, 7:04 PM

gmodena subscribed.Nov 23 2023, 7:06 PM

Just had a chat with @JAllemandou , this could be a good use case for T349763: [Data Quality] Develop Airflow post processing instrumentation to collect and log configurable data metrics.

Compromise: If we chose to go this way, we could add checks for duplicates in an external data quality check that we can follow up with.

This case is interesting because it stretches some of the assumptions wrt the data validation library we are prototyping with. With some regexp-foo we might be able to implement validation with a Check.satisfies constraint., and possibly run a CustomSql analyzer (not sure how useful).

@mforns what kind of information would you need to help troubleshooting? Would knowing that a partition contains duplicate records be sufficient as a trigger for further investigation?

@mforns what kind of information would you need to help troubleshooting? Would knowing that a partition contains duplicate records be sufficient as a trigger for further investigation?

Yes, I think knowing whether the partition contains duplicates is good. Maybe having the number of rows with duplicates would be useful, since it allows us to know if the issue is serious or it can be ignored.

lbowmaker subscribed.Nov 27 2023, 2:28 PM

mforns set the point value for this task to 3.Nov 27 2023, 2:56 PM

mforns edited projects, added Data Products (Data Products Sprint 04); removed Data Products.

mforns moved this task from Sprint Backlog to In Process on the Data Products (Data Products Sprint 04) board.

gmodena mentioned this in T349763: [Data Quality] Develop Airflow post processing instrumentation to collect and log configurable data metrics.Nov 27 2023, 2:59 PM

Discussed during Data Engineering standup: let's fix with spark.sql.mapKeyDedupPolicy=LAST_WIN, and implement the metric monitoring how often it happens :)

Change 977774 had a related patch set uploaded (by Mforns; author: Mforns):

[analytics/refinery@master] Set spark.sql.mapKeyDedupPolicy to LAST_WIN in refine_webrequest

https://gerrit.wikimedia.org/r/977774

gerritbot added a project: Patch-For-Review.Nov 27 2023, 7:06 PM

Change 977774 merged by Milimetric:

[analytics/refinery@master] Set spark.sql.mapKeyDedupPolicy to LAST_WIN in refine_webrequest

https://gerrit.wikimedia.org/r/977774

Maintenance_bot removed a project: Patch-For-Review.Nov 28 2023, 6:10 PM

merged and deployed right now, used to fix another instance of the webrequest duplicate map key failures. Note for future selves: it would be good to figure out where these are coming from still.

In T351909#9359501, @JAllemandou wrote:

… and implement the metric monitoring how often it happens :)

In T351909#9364734, @Milimetric wrote:

… Note for future selves: it would be good to figure out where these are coming from still.

Is it possible to have the monitoring log some information about the rows such that we can figure out where they're coming from?

In T351909#9366161, @phuedx wrote:

Is it possible to have the monitoring log some information about the rows such that we can figure out where they're coming from?

Not in any easy way that I know, because we use spark-sql and it's hard to output to multiple things at the same time, let alone logs. With pyspark or something we would have the dataframe, we could analyze it first and output exceptions like this into one place, then process the rest.

@gmodena is working on adding data-quality metrics on the webrequest dataset (https://phabricator.wikimedia.org/T349763), and we added this one (duplicate map keys) in the list of things to have in the POC.
We shall have some data one of those days :)

If the keys and values are the same, LAST_WIN is great. If the value is different, there is likely problem, and we should alert on it.

In either case, I think the right behavior is LAST_WIN, with alerting if duplicate keys with different values.

Or...maybe just document that last always wins and call it a day?

Looks like there were duplicate x_analytics keys in December data. The following wmf.webrequest partitions have been flagged as buggy (detected during an integration test):

year=2023/month=12/day=24/hour=6
year=2023/month=12/day=22/hour=4
year=2023/month=12/day=6/hour=11

The following wmf.webrequest partitions have been flagged as buggy (detected during an integration test):

FWIW: this check accounts for any x_analytics entry with duplicate keys, regarding of their value. I bounced some ideas around with @JAllemandou earlier today and next to updating the check semantic (and account for duplicate keys with distinct values) we want to track the following cases:

alert (warn level) on duplicate keys regardless of value.
alert (error level) on duplicate keys with the same value (this would break LAST_WIN semantic)

The reason to track both cases is to get insights about eventual quality degradation of x_analytics entries over time, but inform to take action only on cases that break LAST_WIN.

gmodena mentioned this in T354568: [Data Quality][Webrequest] Log severity level of alerts generated by refinery.Jan 8 2024, 8:36 PM

Change 988985 had a related patch set uploaded (by Gmodena; author: Gmodena):

[analytics/refinery/source@master] refinery: log data quality alert severity

https://gerrit.wikimedia.org/r/988985

gerritbot added a project: Patch-For-Review.Jan 9 2024, 8:42 AM

Change 988985 merged by Gmodena:

[analytics/refinery/source@master] refinery: log data quality alert severity

https://gerrit.wikimedia.org/r/988985

Maintenance_bot removed a project: Patch-For-Review.Jan 16 2024, 2:31 PM

WDoranWMF closed this task as Resolved.Jan 24 2024, 4:45 PM