Page MenuHomePhabricator

2024-10-10 Data Loss Incident - webrequest Hive table
Closed, ResolvedPublic

Description

  • Since 2024-09-04 we can see drops in pageview data on turnilo: https://w.wiki/BV7M
  • These drops can also be seen on sampled webrequest data in Druid (we don't have 90 days, only showing at the last 30 days): https://w.wiki/BV7R
  • It affects both webrequest text and upload sources, but a lot less upload: https://w.wiki/BV8R
  • The problem doesn't come from the source nor the stream trqansportation layer (kafka) as streaming data show no drop (streaming sampled webrequest in Druid): https://w.wiki/BV7X
  • Airflow jobs show no failure, spark jobs show no failure.
  • Since a few weeks we have regular alerts about the HDFS RPC queue being overwhelmed: https://grafana.wikimedia.org/goto/PcT_YCzNg?orgId=1

Incident report drafting: https://docs.google.com/document/d/1sJ8f1FHB-gLom5Po0vLjoTdIY5KlaJUwPdClrBY7DtE/edit?tab=t.0#heading=h.2ro4l2vgyh3x
(Will eventually be moved to wikitech)

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Mitigate refine Webrequest concurrency problemrepos/data-engineering/airflow-dags!868aquT376882_webrequest_concurrency_on_write_failuremain
Customize query in GitLab

Event Timeline

JAllemandou triaged this task as Unbreak Now! priority.Oct 10 2024, 9:54 AM

Change #1079231 had a related patch set uploaded (by Joal; author: Joal):

[operations/puppet@production] Update webrequest raw retention period on HDFS

https://gerrit.wikimedia.org/r/1079231

Change #1079231 merged by Btullis:

[operations/puppet@production] Update webrequest raw retention period on HDFS

https://gerrit.wikimedia.org/r/1079231

The problem is visible on HDFS:

hdfs dfs -du -s -h /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/*
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
47.6 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=0
48.4 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=1
61.7 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=10
68.7 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=11
72.6 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=12
75.7 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=13
75.0 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=14
74.3 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=15
72.1 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=16
34.6 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=17     <------
73.2 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=18
73.2 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=19
48.5 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=2
68.8 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=20
58.9 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=21
50.4 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=22
11.2 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=23     <------
48.6 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=3
48.1 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=4
48.9 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=5
29.0 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=6      <-------
56.6 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=7
57.9 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=8
59.7 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=9

My assumption is that the spark job has issues writing some files (we expect the job to write 256 data files plus one _SUCCESS file):

for h in $(seq 0 23); do
hdfs dfs -count /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=${h}
done

  folders      files       size
     1          257        51127043015 /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=0
     1          257        51932086254 /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=1
     1          257        52034244313 /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=2
     1          257        52195982274 /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=3
     1          255        51675216019 /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=4
     1          257        52466637270 /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=5
     1          139        31149593072 /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=6  <----
     1          257        60807492403 /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=7
     1          257        62123501867 /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=8
     1          257        64106326561 /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=9
     1          257        66218439434 /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=10
     1          257        73770859437 /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=11
     1          257        78004081117 /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=12
     1          257        81243426484 /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=13
     1          257        80536355483 /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=14
     1          257        79798430123 /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=15
     1          257        77380651833 /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=16
     1          124        37107759510 /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=17    <-----
     1          253        78644308180 /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=18    <----- hidden using size only!
     1          257        78552471676 /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=19
     1          257        73871182631 /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=20
     1          257        63270015345 /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=21
     1          257        54065064226 /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=22
     1           63        12055321136 /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=23     <------

No such pattern on raw data:

hdfs dfs -du -s -h /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/*

24.4 G  /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=00
24.8 G  /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=01
24.8 G  /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=02
25.0 G  /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=03
24.9 G  /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=04
25.1 G  /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=05
28.0 G  /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=06
29.8 G  /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=07
30.7 G  /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=08
31.7 G  /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=09
32.5 G  /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=10
35.1 G  /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=11
37.3 G  /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=12
39.1 G  /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=13
39.0 G  /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=14
38.3 G  /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=15
37.3 G  /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=16
37.3 G  /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=17
38.6 G  /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=18
37.9 G  /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=19
35.5 G  /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=20
30.8 G  /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=21
26.2 G  /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=22
24.1 G  /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=23

for h in $(seq -w 00 23); do
hdfs dfs -count /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=${h}
done

  folders      files       size
     1          169        26180632197 /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=00
     1          169        26636188711 /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=01
     1          169        26678737634 /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=02
     1          169        26856815446 /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=03
     1          169        26750771368 /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=04
     1          169        26950670567 /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=05
     1          169        30082645048 /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=06
     1          169        31962933331 /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=07
     1          169        32946864821 /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=08
     1          169        34004588520 /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=09
     1          169        34912584945 /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=10
     1          169        37637318452 /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=11
     1          169        40041545452 /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=12
     1          169        42035695332 /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=13
     1          169        41867782904 /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=14
     1          169        41136512109 /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=15
     1          169        40089390924 /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=16
     1          169        40021561174 /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=17
     1          169        41474222805 /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=18
     1          169        40663989712 /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=19
     1          169        38121802617 /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=20
     1          169        33100043001 /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=21
     1          169        28137034444 /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=22
     1          169        25871322707 /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/hour=23

--> The problem comes from the refine-webrequest job

Wha? and the _SUCCESS flag is written even if the job is failing? Or...its not failing somehow?

From Joseph in slack:

Latest state of affairs:

We have agreed with @aqu to use the mapreduce.fileoutputcommitter.algorithm.version=2 -- It's not perfect but at least should prevent most of failures. See here for the difference between v1 and v2: http://www.openkb.info/2019/04/what-is-difference-between.html

The problem occur not only since HDFS RPC queue storms starting 2024-09-04. It was also present before, but less prevalent (15% of webrequest_text hours have missing files, representing 0.3% of data (not that bad))

Yes, we suspect this issue is related to concurrency access on the _temporary folder that Spark uses when running with algorithm.version=1.

The problem arises because both the upload web request refine process and the text web request refine process are using the same _temporary directory. One process may be clearing the _temporary folder while the other is still writing to it, potentially causing the missing files.

With algorithm.version=2, Spark tasks write directly to the final output location, bypassing the need to stage files in the _temporary directory.

In our situation, switching to algorithm.version=2 might resolve the concurrency issue by eliminating the shared use of the _temporary directory between the two processes.

I'm going to try to summarize the suspected problem from today's Data Eng sync meeting.

The issue that causes this is described well in T347076#9334900.

Fixing this globally was considered in T351388: Make Airflow SparkSQL operator set fileoutputcommitter.algorithm.version=2 to avoid concurrent write issues, but it was decided to make airflow default to not allowing parallel execution of dag tasks by setting max_active_dag_runs=1. This decision was then later reverted back to 3.

Instead, folks are advised to set spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version = 2 if they encounter this error when backfilling, as documented here.

We parallelize the refine webrequest by generating two dags for each hour, one for each webrequest_source, which are currently only 'upload' and 'text'. This was done because many downstream jobs only care about the webrequest_upload=text Hive partition. Refining these separately allows for users to depend on one or the other without having to wait for both.

Note that the use of max_active_dag_runs=1 would not have avoided this current issue with webrequest loss. These are separate dags, so their tasks are allowed to run in parallel even if max_active_dag_runs=1 (it is currently 3 anyway).

The use of spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version = 2 is basically necessary if parallel writes are to be used on non iceberg Hive tables. We were not using it for webrequest.


So, it is likely that we have been experiencing small webrequest loss for a long period of time, at least since we migrated webrequest to airflow, but possible before. We only noticed this now because of a recent Dumps 2 Iceberg optimization, described in T374656#10218643. This optimization reduced the time for some Dumps 2 write tasks, but hugely increased the usage of Hadoop RPC APIs. This increase resulted in Hadoop RPC failures in other unrelated jobs, like webrequest refinement.

@JAllemandou, If this is truly the problem, why don't we see an error like

java.io.FileNotFoundException: File hdfs://analytics-hadoop/user/analytics-product/data/wikipedia_chatgpt_plugin_searches/daily/_temporary/0 does not exist.

as was seen in T347076: NEW BUG REPORT Some DAG run attempts fail because File *_temporary/0 does not exist.?

I can see how one Hadoop AM with spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version = 1 might clobber another AM's _temporary directory, but why wouldn't that AM then fail?

Discussed ^ with @Antoine_Quhen in slack.

We think the failure path is different here than in T347076. But, I'm still not sure.

If job A finishes and deletes _temporary while job B is running, yes job B's previously finished task output will be lost. But, how is job B proceeding to write from new tasks and not failing? As far as I can tell, _temporary is created when setupJob is called, not during task initialization.

Ah, Antoine found this comment:

// FileOutputCommitter's setupTask doesn't do anything. Because the
// temporary task directory is created on demand when the 
// task is writing.

Welp, that would to it! I guess we'll have to take their word for it!

This morning I started to backfill hours with missing data using this document.
Since jobs were running in parallel, they have experienced the "temporary folder deletion" issue. I had 2 jobs running in parallel, and the one having finished later had failed tasks.
However both result folders have 257 files. This means our solution setting spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 for the jobs works.

There still can be synchronization cases where issues can arise, but with extremely small probability. I would nonetheless recommend we add a quality check at the end of the webrequest jobs to validate that the resulting folder has indeed 257 files.

...
So, it is likely that we have been experiencing small webrequest loss for a long period of time, at least since we migrated webrequest to airflow, but possible before. We only noticed this now because of a recent Dumps 2 Iceberg optimization, described in T374656#10218643. This optimization reduced the time for some Dumps 2 write tasks, but hugely increased the usage of Hadoop RPC APIs. This increase resulted in Hadoop RPC failures in other unrelated jobs, like webrequest refinement.

I don't think these two issues are related, as the RPC issue described in T374656#10218643 started on Sept 24 2024, and the issue from this ticket started way before on Sept 9th as per T376882#10217047.

I just commented on https://issues.apache.org/jira/browse/MAPREDUCE-7331 suggesting that they just fix this bug once and for all by making the PENDING_WORK_DIR unique per job.

the issue from this ticket started way before on Sept 9th as per T376882#10217047.

Yes, but we suspect the RPC issue somehow exacerbated the issue. The loss is worse for some hours on Sept 24, 25 and 26 than any other days. There is an instance of a lot of loss in an hour on Sept 4 though, which doesn't line up. So, ya, it could just be a coincidence.

Ottomata renamed this task from Some hours of webrequest are not refined entirely to 2024-10-10 Data Loss Incident - webrequest Hive table .Oct 11 2024, 3:09 PM

@JAllemandou I started backfilling webrequest_actor_metrics_hourly but got a little confused as to what airflow was doing with the backfill command. I think I need a little airflow admin assistance.

@xcollazo helped me. I was using the Airflow CLI, and thought the right command was airflow dags backfill. That was incorrect. Also, launching a backfill for a dagrun that has already been scheduled results in weird behavior. We had to delete the dagruns I mistakenly backfilled, and then backfill them again, to get them into the correct state.

The others I was able to just clear and they reran fine.

OpsWeek note:

We are getting the following failures:

FAIL: refinery-drop-webrequest-raw-partitions

Alert thread: https://groups.google.com/a/wikimedia.org/g/data-engineering-alerts/c/2XQVFvoDgb0

This is failing due to https://github.com/wikimedia/operations-puppet/commit/5283bcac15ec01144dc33291d99fd6beb5691870, which changed the retention period, but not the hash we pass in the --execute parameter.

I understand that this is temporary until we fix this ticket. But just nothing this here so that we do not forget.

In case this will take longer to fix, let me know, and I can put together a patch to fix the --execute parameter.

I generated the new (temporary) checksum via:

sudo -u hdfs bash
cd /srv/deployment/analytics/refinery
export PYTHONPATH="/srv/deployment/analytics/refinery/python"
kerberos-run-command hdfs bin/refinery-drop-older-than --database='wmf_raw' --tables='webrequest' --base-path='/wmf/data/raw/webrequest' --path-format='.+/year=(?P<year>[0-9]+)(/month=(?P<month>[0-9]+)(/day=(?P<day>[0-9]+)(/hour=(?P<hour>[0-9]+))?)?)?' --older-than='60' --allowed-interval='3' --skip-trash

Ran 35 tests in 0.113s

OK
2024-10-11T21:08:09 INFO   Unit tests passed.
2024-10-11T21:08:09 INFO   Starting DRY-RUN.
2024-10-11T21:08:36 INFO   No Hive partitions dropped for table wmf_raw.webrequest.
2024-10-11T21:08:36 INFO   No directories removed.
2024-10-11T21:08:36 INFO   DRY-RUN finished.
Security checksum (use --help for more information): 04fce0c9fbb6bc837e4562d9328de41b

Change #1079573 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[operations/puppet@production] Fix security checksum for web_request's refinery-drop-older-than

https://gerrit.wikimedia.org/r/1079573

Change #1079573 merged by Btullis:

[operations/puppet@production] Fix security checksum for web_request's refinery-drop-older-than

https://gerrit.wikimedia.org/r/1079573

Thanks a lot for fixing the data-deletion checksum @xcollazo !

BTullis subscribed.

I think that Data-Engineering has this all in-hand now, is that right? Until the time comes to revert the change to the webrequest retention time.

I created this patch:
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/879
... to enable the spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 by default in the WMF Airflow Spark Operator

..as per: T351388: Make Airflow SparkSQL operator set fileoutputcommitter.algorithm.version=2 to avoid concurrent write issues

@JAllemandou / @Ahoelzl: Should this task remain open and at "Unbreak now" priority ("Something is broken and needs to be fixed immediately, setting anything else aside")? Asking as there has been no activity for a month. Thanks!

This could be resolved. However our process is things in our quarterly Done column are not resolved until the end of the Sprint.

Should we lower the priority?