Airflow sends a failure alert for the dag refine_webrequest_hourly_text and task reine_werequest concerning a specific dag run (1 hour of data).
The logs say:
java.lang.RuntimeException: Duplicate map key preview was found, please check the input data.
The problem happens when the job tries to read wmf_raw.webrequest and parse the x_analytics field (string) as a map with str_to_map(x_analytics, '\;', '=').
The x_analytics string values that cause the error have the same key more than once, like:
preview=1;ns=0;page_id=46350;include_pv=1;preview=1;https=1;client_port=1588
In this case, it was preview. It also happened with the key pageview.
The number of affected rows in a wmf_raw.webrequest hourly partition has been so far between 1 and 2.
During the last couple months, we've seen this issue several times. Only the last week it happened 6 times.
We can manually re-run the job by filtering out the corrupted rows with the excluded_row_ids mechanism.
But it is a manual task that takes time from us, and delays all the webrequest/pageview dependent pipelines.
Thus, we should fix this permanently.
A possible solution is to set the spark option in the code spark.sql.mapKeyDedupPolicy=LAST_WIN, so that duplicate keys do not break the computation.
This would let the last duplicated key-value pair to be the effective one when parsing the x_analytics string into a map.
Pro: it's the simplest solution, one-liner.
Con: it could hide issues with the x_analytics header (and other str_to_map operations in the query - TLS field).
Compromise: If we chose to go this way, we could add checks for duplicates in an external data quality check that we can follow up with.