Page MenuHomePhabricator

Eventlogging's processors stopped working
Closed, DeclinedPublic

Description

On Saturday Jul 28th at around 15:48 UTC all the eventlogging processors stopped working all of a sudden. From the logs it seems that this happens (for each processor):

  1. the processor's consumer (via kafka-python) gets stuck reading one or more events
  2. the processor's producer sockets are torn down by the Kafka brokers after some minutes (corresponding to the default idle timeout before cleaning up "unused" sockets).

The next step is try to figure out what causes this mess and let Eventlogging restart.

Event Timeline

elukey triaged this task as Unbreak Now! priority.Jul 29 2018, 12:25 AM
elukey created this task.
Restricted Application added subscribers: Liuxinyu970226, TerraCodes, Aklapper. · View Herald TranscriptJul 29 2018, 12:25 AM

Current state is:

  • all processors down except the 11th (to work only on one)
  • logging debug enabled for eventlogging, kafka-python and confluent-kafka-python (manually hacking eventlogging's py files on disk).

It seems that each time I restart the processor, some consumer/produce activity happens, but then everything stops. One curious thing is that if I isolate a debug log like the following, I can keep seeing it after every run:

VirtualPageView is blacklisted, not writing event <Event 36ed67cd3a5f51f5a06abebe0db66efe of schema (u'VirtualPageView', 17780078)>

After a bit of a restart, the only debug logs that keeps going are like the following (for various topics):

Jul 29 08:14:25 eventlog1002 eventlogging-processor@client-side-11[31796]: %7|1532852065.403|TOPPAR|eventlogging-bda8e544-9306-11e8-8347-1418775b0d42-eventlog1002.eqiad.wmnet.31796#producer-1| kafka-jumbo1002.eqiad.wmnet:9092/1002: eventlogging_InputDeviceDynamics [0] 0+0 msgs

Nothing relevant on the kafka broker logs afaics.

elukey added a comment.EditedJul 29 2018, 8:24 AM

Spot checking with kafkacat on various topics (eventlogging_$something) I can see that the last timestamp registered is "2018-07-28T15:44:54Z" (exactly when the mess happened). More interestingly, on eventlogging-valid-mixed I can see two timestamps that pops up as last one in the various partitions:

  • 2018-07-28T15:44:54Z
  • 2018-07-28T17:09:45Z

It is also interesting that 2018-07-28T15:44:54Z is the last timestamp for the all-events consumer as well (that polls from eventlogging-valid-mixed), meanwhile the client-side-events one is constantly working as nothing happened.

So now I am wondering: if the client-side-events consumer, that should use kafka-python to consume, is not stuck but the processors are, does this mean that the issue is on the confluent-kafka-python producers side?

The python process of processor 11th (the only one currently running as written above) is listed by top as:

 PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
2032 eventlo+  20   0 1426140  89588  10608 R 99.9  0.1   9:34.68 python

strace -p $pid -f shows constantly:

[pid  2040] futex(0x556186dfab7c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 13359, {tv_sec=1532853092, tv_nsec=394968000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2040] futex(0x556186dfab50, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  2040] poll([{fd=4, events=POLLIN}], 1, 100 <unfinished ...>
[pid  2046] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)

And for gdb, I can see this result for thread apply all bt:

Thread 18 (Thread 0x7f406e435700 (LWP 2053)):
#0  0x00007f40ac9af536 in do_futex_wait.constprop () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f40ac9af5e4 in __new_sem_wait_slow.constprop.0 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00005561835ead34 in PyThread_acquire_lock ()
#3  0x0000556183636caa in PyEval_RestoreThread ()
#4  0x00005561836e2a49 in ?? ()
#5  0x000055618360fb9a in PyEval_EvalFrameEx ()
#6  0x000055618360dd45 in PyEval_EvalCodeEx ()
#7  0x0000556183615738 in PyEval_EvalFrameEx ()
#8  0x000055618360dd45 in PyEval_EvalCodeEx ()
#9  0x0000556183615f15 in PyEval_EvalFrameEx ()
#10 0x000055618360dd45 in PyEval_EvalCodeEx ()
#11 0x0000556183615f15 in PyEval_EvalFrameEx ()
#12 0x000055618361542f in PyEval_EvalFrameEx ()
#13 0x000055618360dd45 in PyEval_EvalCodeEx ()
#14 0x000055618362aa7e in ?? ()
#15 0x00005561835fc413 in PyObject_Call ()
#16 0x000055618364135e in ?? ()
#17 0x00005561835fc413 in PyObject_Call ()
#18 0x0000556183619ec0 in PyEval_CallObjectWithKeywords ()
#19 0x00005561836e25d2 in ?? ()
#20 0x00007f40ac9a7494 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#21 0x00007f40abdc4acf in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 17 (Thread 0x7f406effd700 (LWP 2052)):
#0  0x00007f40abdbb67d in poll () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f40a86f2a67 in ?? () from /usr/lib/x86_64-linux-gnu/librdkafka.so.1
#2  0x00007f40a86f2aab in ?? () from /usr/lib/x86_64-linux-gnu/librdkafka.so.1
#3  0x00007f40a86e3830 in ?? () from /usr/lib/x86_64-linux-gnu/librdkafka.so.1
#4  0x00007f40a86e50bc in ?? () from /usr/lib/x86_64-linux-gnu/librdkafka.so.1
#5  0x00007f40a8722bf7 in ?? () from /usr/lib/x86_64-linux-gnu/librdkafka.so.1
#6  0x00007f40ac9a7494 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007f40abdc4acf in clone () from /lib/x86_64-linux-gnu/libc.so.6

[..the last repeated over and over..]

I installed python-dbg and then I was able to use the awesome thread apply all py-bt function in gdb, that returned:

(gdb) thread apply all py-bt

Thread 18 (Thread 0x7f406e435700 (LWP 2053)):
Traceback (most recent call first):
  Waiting for the GIL
  File "/usr/lib/python2.7/threading.py", line 359, in wait
    _sleep(delay)
  File "/usr/lib/python2.7/dist-packages/kafka/coordinator/base.py", line 965, in _run_once
    self.coordinator._lock.wait(self.coordinator.config['retry_backoff_ms'] / 1000)
  File "/usr/lib/python2.7/dist-packages/kafka/coordinator/base.py", line 910, in run
    self._run_once()
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
    self.__bootstrap_inner()

Thread 17 (Thread 0x7f406effd700 (LWP 2052)):
Traceback (most recent call first):

Thread 16 (Thread 0x7f406f7fe700 (LWP 2051)):
Traceback (most recent call first):

Thread 15 (Thread 0x7f4067fff700 (LWP 2050)):
Traceback (most recent call first):

Thread 14 (Thread 0x7f406ffff700 (LWP 2049)):
Traceback (most recent call first):

Thread 13 (Thread 0x7f409cdf3700 (LWP 2048)):
Traceback (most recent call first):

Thread 12 (Thread 0x7f409d5f4700 (LWP 2047)):
Traceback (most recent call first):

Thread 11 (Thread 0x7f409ddf5700 (LWP 2046)):
Traceback (most recent call first):

Thread 10 (Thread 0x7f409e5f6700 (LWP 2045)):
Traceback (most recent call first):

Thread 9 (Thread 0x7f409effd700 (LWP 2044)):
Traceback (most recent call first):

Thread 8 (Thread 0x7f409f7fe700 (LWP 2043)):
Traceback (most recent call first):

---Type <return> to continue, or q <return> to quit---
Thread 7 (Thread 0x7f409ffff700 (LWP 2042)):
Traceback (most recent call first):

Thread 6 (Thread 0x7f40a4836700 (LWP 2041)):
Traceback (most recent call first):

Thread 5 (Thread 0x7f40a5037700 (LWP 2040)):
Traceback (most recent call first):

Thread 4 (Thread 0x7f40a5838700 (LWP 2039)):
Traceback (most recent call first):

Thread 3 (Thread 0x7f40a6039700 (LWP 2038)):
Traceback (most recent call first):

Thread 2 (Thread 0x7f40a683a700 (LWP 2037)):
Traceback (most recent call first):

Thread 1 (Thread 0x7f40acdd2700 (LWP 2032)):
Traceback (most recent call first):
  File "/usr/lib/python2.7/dist-packages/ua_parser/user_agent_parser.py", line 51, in Parse
    match = self.user_agent_re.search(user_agent_string)
  File "/usr/lib/python2.7/dist-packages/ua_parser/user_agent_parser.py", line 252, in ParseUserAgent
    family, v1, v2, v3 = uaParser.Parse(user_agent_string)
  File "/usr/lib/python2.7/dist-packages/ua_parser/user_agent_parser.py", line 225, in Parse
    'user_agent': ParseUserAgent(user_agent_string, **jsParseBits),
  File "/srv/deployment/eventlogging/analytics/eventlogging/utils.py", line 405, in parse_ua
    parsed_ua = user_agent_parser.Parse(user_agent)
  File "/srv/deployment/eventlogging/analytics/eventlogging/parse.py", line 236, in <dictcomp>
    if k != 'capsule' and k not in capsule.keys()
  File "/srv/deployment/eventlogging/analytics/eventlogging/parse.py", line 232, in parse
    k: f(matches[k]) for k, f in caster_dict.items()
  File "/srv/deployment/eventlogging/analytics/bin/eventlogging-processor", line 129, in <module>
    event = parser.parse(raw_event)

And a even more awesome py-locals:

Thread 1 (Thread 0x7f40acdd2700 (LWP 2032)):
  46                               for group_index in range(1, match.lastindex + 1)]
  47            return match_spans
  48
  49        def Parse(self, user_agent_string):
  50            family, v1, v2, v3 = None, None, None, None
 >51            match = self.user_agent_re.search(user_agent_string)
  52            if match:
  53                if self.family_replacement:
  54                    if re.search(r'\$1', self.family_replacement):
  55                        family = re.sub(r'\$1', match.group(1), self.family_replacement)
  56                    else:
(gdb) py-locals
self = <UserAgentParser(v1_replacement=None, pattern=u'^(.*)/(\\d+)\\.?(\\d+)?.?(\\d+)?.?(\\d+)? CFNetwork', v2_replacement=None, user_agent_re=<_sre.SRE_Pattern at remote 0x556186bd18b0>, family_replacement=None) at remote 0x7f40a9de62d0>
user_agent_string = u'"Mozilla/5.0 (X11; Linux x86_64_128) AppleWebKit/11111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111.111111111111111111111111111111111111111111111111111111111111111111111111 (KHTML, like Gecko) Linux/222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222'
family = None
v1 = None
v2 = None
v3 = None

So this is the first time that I use these awesome gdb extension, but it seems that a weird UA blocks EL?

Tested the following, seems not completing:

>>> from ua_parser import user_agent_parser
>>> ua_string = 'Mozilla/5.0 (X11; Linux x86_64_128) AppleWebKit/11111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111.111111111111111111111111111111111111111111111111111111111111111111111111 (KHTML, like Gecko) Linux/222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222'
>>> parsed_string = user_agent_parser.Parse(ua_string)

Checking top my python process gets up to 99% and shows the same behavior, so I think this is the issue that blocks the eventlogging processors.

elukey added a subscriber: Volans.Jul 29 2018, 10:36 AM

A even simpler repro case that me and @Volans found is the following:

import re

ua_string = 'Mozilla/5.0 (X11; Linux x86_64_128) AppleWebKit/11111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111.111111111111111111111111111111111111111111111111111111111111111111111111 (KHTML, like Gecko) Linux/222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222'
a = re.compile('^(.*)/(\\d+)\\.?(\\d+)?.?(\\d+)?.?(\\d+)? CFNetwork')
a.search(ua_string)

So, after digging a bit, the culprit is the regex that @elukey has pasted above: ^(.*)/(\\d+)\\.?(\\d+)?.?(\\d+)?.?(\\d+)? CFNetwork, that comes from https://github.com/ua-parser/uap-core/blob/master/regexes.yaml#L12
Note also how the dots after the first one are not escaped, so matching any character, that's probably a bug and not what was intended. Also the few regex above that one have the same issue.

Now while this works for example in PCRE, it goes in an infinite backtracking in Python. I've tried to expand it to avoid the backtracking into something like:

'^(.*)/(\\d+)(?:|\\.|\\.(\\d+)|\\.(\\d+)\\.|\\.(\\d+)\\.(\\d+)|\\.(\\d+)\\.(\\d+)\\.|\\.(\\d+)\\.(\\d+)\\.(\\d+)) CFNetwork'

Basically expanding all the possible options of versions: /1, /1., /1.2, /1.2., /1.2.3 /1.2.3., /1.2.3.4
While the regex works fine, the problem is that the matching groups are different from what the software expects, hence it fails to properly recognize the parameters. Named regex would be much better here, but the ua-core project is basically language-agnostic, not allowing to take advantage of this.

Hence my suggestion is to split them into 4 regexes:

^(.*)/(\d+) CFNetwork
^(.*)/(\d+)\.(\d+)? CFNetwork
^(.*)/(\d+)\.(\d+)\.(\d+)? CFNetwork
^(.*)/(\d+)\.(\d+)\.(\d+)\.?(\d+)? CFNetwork

To do this we can either modify the ua-core before building the package or injecting them in setup.py when the file get's generated replacing the faulty regex.
For testing the equivalence of the two approaches I've used this:

from ua_parser import user_agent_parser

ua_strings_matching = [
    'foo/1 CFNetwork',
    'foo/1. CFNetwork',
    'foo/1.2 CFNetwork',
    'foo/1.2. CFNetwork',
    'foo/1.2.3 CFNetwork',
    'foo/1.2.3. CFNetwork',
    'foo/1.2.3.4 CFNetwork'
]
for ua in ua_strings_matching:
    print(user_agent_parser.Parse(ua))

And compared the results with the original code and with my patch applied.

Volans added a comment.EditedJul 29 2018, 11:37 AM

On a side note, when parsing the very long UA example from @elukey, there is one regex that is pretty slow, and it's https://github.com/ua-parser/uap-core/blob/master/regexes.yaml#L98

((?:[A-z0-9]+|[A-z\-]+ ?)?(?: the )?(?:[Ss][Pp][Ii][Dd][Ee][Rr]|[Ss]crape|[A-Za-z0-9-]*(?:[^C][^Uu])[Bb]ot|[Cc][Rr][Aa][Ww][Ll])[A-z0-9]*)(?:(?:[ /]| v)(\d+)(?:\.(\d+)(?:\.(\d+))?)?)?

On my local system it took 1.21 seconds over a total parsing time of 1.28 seconds!

Nuria added a subscriber: Nuria.Jul 29 2018, 2:23 PM

In order to unnlock we could modify parsing code to do like we do in pageview processing, if ua is over a certain size we just skip it completely https://github.com/wikimedia/analytics-refinery-source/commit/8ce2f097737ac4839656a0b8cd738f9c63d73a13

In order to unnlock we could modify parsing code to do like we do in pageview processing, if ua is over a certain size we just skip it completely https://github.com/wikimedia/analytics-refinery-source/commit/8ce2f097737ac4839656a0b8cd738f9c63d73a13

It might be an option but I think it would restrict too much the UA that we parse. The current UA that breaks EL is ~660 chars, and in order to make it work we'd need to truncate it by 2/3 at least afaics. The main issue seems to be all the numbers creating backtracking when the regex is applied :(

Nuria added a comment.EditedJul 29 2018, 3:08 PM

Length of UA at fault is 1000 chars and regular UAs are about 200 chars, how about not running through UA parser anything bigger than 4 times the regular limit that is 800 chars? (if it prevents the issue with the parsing chocking that is, cc @Volans)

Length of UA at fault is 1000 chars and regular UAs are about 200 chars, how about not running through UA parser anything bigger than 4 times the regular limit that is 800 chars? (if it prevents the issue with the parsing chocking that is, cc @Volans)

Yes sorry miscalculated the len, I remembered something less than 999 but probably it was a different test. 800 chars seems a lot but indeed it might help us to resolve (temporarily) this issue and keep going with the EL processing. I would feel more comfortable in changing the regex to avoid a recurrence of this mess again though (might be something to do as second step though).

Change 449121 had a related patch set uploaded (by Elukey; owner: Elukey):
[eventlogging@master] [WIP] Discard events with long user agents

https://gerrit.wikimedia.org/r/449121

Length of UA at fault is 1000 chars and regular UAs are about 200 chars, how about not running through UA parser anything bigger than 4 times the regular limit that is 800 chars? (if it prevents the issue with the parsing chocking that is, cc @Volans)

@Nuria A length limit protection is surely a nice to have for sanity, but it might not protect from regex backtracking in general, for example the UA from @elukey will still take a huge amount of time with lengths much shorter than the proposed limit.
A more general approach could be to limit the amount of time that the UA parser is allowed to spend on a single UA, bailing out after a defined limit of time and considering it a "broken" UA.

All of the above plus the revisitation of the regexes can live together and don't exclude each other.

Change 449121 merged by Mforns:
[eventlogging@master] Discard events with long user agents

https://gerrit.wikimedia.org/r/449121

Mentioned in SAL (#wikimedia-operations) [2018-07-30T08:07:08Z] <elukey@deploy1001> Started deploy [eventlogging/analytics@54d43e4]: Band aid for T200630

Mentioned in SAL (#wikimedia-operations) [2018-07-30T08:07:13Z] <elukey@deploy1001> Finished deploy [eventlogging/analytics@54d43e4]: Band aid for T200630 (duration: 00m 05s)

elukey lowered the priority of this task from Unbreak Now! to High.Jul 30 2018, 8:30 AM
elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.Jul 30 2018, 2:39 PM

Change 449422 had a related patch set uploaded (by Elukey; owner: Elukey):
[eventlogging@master] Set logging level to all kafka libraries to ease debugging sessions

https://gerrit.wikimedia.org/r/449422

Nuria added a comment.Aug 7 2018, 5:20 PM

@Volans: much agree with the len limit not being a fix at all, rather a band aid needed to keep service afloat. The best solution on my opinion is to have a timeout when processing the ua regex.

elukey removed elukey as the assignee of this task.Sep 6 2018, 1:29 PM
elukey moved this task from In Progress to Next Up on the Analytics-Kanban board.
Nuria reassigned this task from Milimetric to fdans.Nov 3 2018, 4:32 AM
Nuria added a subscriber: Milimetric.
Nuria removed fdans as the assignee of this task.Nov 3 2018, 4:35 AM
Nuria added a subscriber: fdans.

As reminder I just got a notification from GH about the issue that I've opened right after this issue: https://github.com/ua-parser/uap-core/issues/332

They seems to suggest to check version 0.6.0 of the parser lib :)

I took a look at the pull request, https://github.com/ua-parser/uap-core/pull/368, it looks like it fixes the bots regexes by replacing unlimited * or + repeaters with limited ones (eg {0.50}). It's probably a good idea to upgrade, but I would still want to do the fix we decided on here, as it would protect the processors in the general case.

Milimetric closed this task as Declined.Mar 4 2019, 4:51 PM

Closing because we haven't seen the issue resurface. Our temporary fix seems fine until we transition off of EventLogging, hopefully sooner than later.

Restricted Application removed a subscriber: Liuxinyu970226. · View Herald TranscriptMar 4 2019, 4:51 PM