Page MenuHomePhabricator

Webrequest text fails to refine regularly
Closed, ResolvedPublic

Description

The latest hours that failed:

https://hue.wikimedia.org/oozie/list_oozie_workflow/0007757-191216160148723-oozie-oozi-W/?coordinator_job_id=0000001-191216144226458-oozie-oozi-C&bundle_job_id=0000000-191216144226458-oozie-oozi-B
https://hue.wikimedia.org/oozie/list_oozie_workflow/0007797-191216160148723-oozie-oozi-W/?coordinator_job_id=0000001-191216144226458-oozie-oozi-C&bundle_job_id=0000000-191216144226458-oozie-oozi-B

Taking hour 17 above as example, here is what we observe:

Event Timeline

Looking at volume on varnish and kafka nothing stands out

Webrequest_text volumes look ok and ditto for upload

60.5 G  181.5 G  hdfs://analytics-hadoop/wmf/data/raw/webrequest/webrequest_text/hourly/2019/12/14/11
64.2 G  192.7 G  hdfs://analytics-hadoop/wmf/data/raw/webrequest/webrequest_text/hourly/2019/12/14/12
68.3 G  205.0 G  hdfs://analytics-hadoop/wmf/data/raw/webrequest/webrequest_text/hourly/2019/12/14/13
70.7 G  212.2 G  hdfs://analytics-hadoop/wmf/data/raw/webrequest/webrequest_text/hourly/2019/12/14/14
71.9 G  215.7 G  hdfs://analytics-hadoop/wmf/data/raw/webrequest/webrequest_text/hourly/2019/12/14/15
72.2 G  216.7 G  hdfs://analytics-hadoop/wmf/data/raw/webrequest/webrequest_text/hourly/2019/12/14/16
72.3 G  216.9 G  hdfs://analytics-hadoop/wmf/data/raw/webrequest/webrequest_text/hourly/2019/12/14/17
71.0 G  213.0 G  hdfs://analytics-hadoop/wmf/data/raw/webrequest/webrequest_text/hourly/2019/12/14/18

See prior similar problem: T197281: Fix failing webrequest hours (upload and text 2018-06-14-11)

  • Number of records is about same at that hour than the same hour the day before

121414760/124226799.0
0.9773636685269497

Also looked at logs of one of the tasks killed cause it is exceeding memory

>sudo -u analytics mapred job -logs job_1573208467349_185271 attempt_1573208467349_185271_m_000000_1  > attempt_1573208467349_185271_m_000000_1.txt

And besides some warnings about not finding ids on maxmind there is nothing so maybe this is just an issue with container memory being on the edge?

  • I think* there might be an error log on /var/lib/hadoop/data/c/yarn/logs/application_1573208467349_185271/container_e02_1573208467349_185271_01_000368/ on analytics1070 that might be of use

With a special version of refinery (uploaded to hdfs in /user/elukey and /user/joal) we were able to refine the two hours in separate coordinators. The issue is still not clear but we wanted to unblock jobs and the kerberos upgrade.

Looked whether the entropy UDAF indicates any unexpected changes in the webrequest fields that are used in webrequest.load.

=== USER AGENT ===
hour	entropy
11	10.54753394241449
12	10.56885036456681
13	10.494309570202567
14	10.475783511181957
15	10.41000334885259

=== X-ANALYTICS ===
hour	entropy
11	4.76234273934101
12	4.679339853346528
13	4.633225441013469
14	4.600864030491075
15	4.592986151529772

=== IP ===
hour	entropy
11	21.225638969339453
12	21.413682830148964
13	21.38601943730442
14	21.686812764868563
15	21.723495858041463

=== URI_HOST ===
hour	entropy
11	4.574810372073657
12	4.5582752093968155
13	4.460794216497132
14	4.449782274496132
15	4.409413677750083

=== URI_PATH ===
hour	entropy
11	12.915420900915635
12	12.812357193481974
13	13.153114455626495 <---------(?)
14	12.848538635940937
15	12.921515275279035

=== URI_QUERY ===
hour	entropy
11	7.0312416154907575
12	7.000230030519391
13	6.853914731584187 <----------(?)
14	6.945057342791713
15	6.929870232583995

Nice, I also looked at urls by hand a bit and could not find something of significance. I think we need to keep this on the "pause" column for now.

mforns added a project: Analytics-Kanban.
mforns moved this task from Next Up to Paused on the Analytics-Kanban board.
JAllemandou moved this task from Incoming to Ops Week on the Analytics board.
JAllemandou renamed this task from two hours of webrequest not refining to Webrequest text fails to refine regularly.Dec 23 2019, 9:43 AM
JAllemandou claimed this task.
JAllemandou updated the task description. (Show Details)

Something to notice: mappers regularly fail during webrequest-text refine, but not in a way that makes the job fail. Rerun succeeds, probably because of GC and yarn checks synchronization. Something else interesting to notice is that the number of failed mappers decreases with data size to treat.

  • webrequest-load-wf-text-2019-12-22-15 - 21 failed mappers
  • webrequest-load-wf-text-2019-12-22-16 - 14 failed mappers
  • webrequest-load-wf-text-2019-12-22-18 - 12 failed mappers
  • webrequest-load-wf-text-2019-12-22-19 - 4 failed mappers
  • webrequest-load-wf-text-2019-12-22-20 - 3 failed mappers
  • webrequest-load-wf-text-2019-12-22-21 - 2 failed mappers
  • webrequest-load-wf-text-2019-12-22-22 - 0 failed mappers

The mappers-memory problem comes from the user-agent parsing UDF. We have changed it not long ago (https://gerrit.wikimedia.org/r/#/c/analytics/refinery/source/+/537317) growing its cache, therefore using more memory.
Check:

  1. Run webrequest-refine manually on webrequest_source='text' AND year=2019 AND month=12 AND day=22 AND hour=17 --> failure at map-stage as expected
  2. Run the same as above removing user_agent UDF usage --> Success (no failed mapper)
  3. Run the regular version with SET mapreduce.map.memory.mb=3072; SET mapreduce.map.java.opts=-Xmx2458m; --> Success (no failed mapper)

Side-note: The error occurs on weekend because of usage patterns. On weekends mobile hits are a lot bigger than desktop ones while it's the opposite on week-days (see turnilo pageviews). This leads to a bigger number of distinct user-agents in weekends (I used month=12 and day=10 and hour = 17 as it's the biggest hour in term of data-size available today).

spark.sql("select count(distinct user_agent) from wmf.webrequest where webrequest_source = 'text' and year = 2019 and month = 12 and day = 22 and hour = 17").show(100, false)

+--------------------------+                                                    
|count(DISTINCT user_agent)|
+--------------------------+
|584474                    |
+--------------------------+


spark.sql("select count(distinct user_agent) from wmf.webrequest where webrequest_source = 'text' and year = 2019 and month = 12 and day = 10 and hour = 17").show(100, false)

+--------------------------+                                                    
|count(DISTINCT user_agent)|
+--------------------------+
|447644                    |
+--------------------------+

Suggested fix: Bump mappers' memory in webrequest-refine to 3072m (patch to follow).


Memory issue observed for reducers comes from me providing wrong number-pairs for mapreduce.*.java.opts in relation to mapreduce.*.memory.mb here https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/557836/3/oozie/webrequest/load/refine_webrequest.hql.
mapreduce.*.memory.mb is used by YARN to check container resource consumption, while mapreduce.*.java.opts is used by the JVM to manage memory (noticeably GC). The later needs to be smaller than the first (0.8 ratio is usually advised) to make GC keep memory foot-print smaller than the limit YARN enforces. In my patch I provided mapreduce.*.java.opts BIGGER than mapreduce.*.memory.mb.
Check:

  • After above success, run a webrequest version with SET mapreduce.reduce.memory.mb=4096; SET mapreduce.reduce.java.opts=-Xmx4915m; --> Fails with reducers failures.

I'm sorry to have generated more problems than I solved with my quick patch the other day :(

Change 557836 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Override default map memory in webrequest-refine

https://gerrit.wikimedia.org/r/557836

Change 557836 merged by Elukey:
[analytics/refinery@master] Override default map memory in webrequest-refine

https://gerrit.wikimedia.org/r/557836