Page MenuHomePhabricator

Troubleshoot EventLogging missing chunks of data {oryx}
Closed, ResolvedPublic

Description

We've observed that EL database (m4-master.equiad.wmnet) has some "holes", some time spans that contain no data for any event table. For instance:

2015-04-09 17:20 -> 18:35
2015-04-11 03:35 -> 05:20
2015-04-11 14:00 -> 15:20
2015-04-11 19:00 -> 20:00
2015-04-12 14:30 -> 15:40
2015-04-13 11:35 -> 12:30
2015-04-13 16:30 -> 17:35

We know that:

  • EL processor is validating correctly the missing events and writing the correspondent logs.
  • EL mysql consumer is homogeneously writing events at the time of the holes. But we can not tell if these events are the ones that are missing or not, because the consumer logs indicate only the insertion timestamp and not the event timestamp.
  • The database does not receive any inserts for the missing data.

Details

Related Gerrit Patches:
mediawiki/extensions/EventLogging : masterInsert events by scid
mediawiki/extensions/EventLogging : masterChange consumer logs to better report batch inserts
mediawiki/extensions/EventLogging : masterAdd last event timestamp to mysql consumer log

Event Timeline

mforns created this task.Apr 14 2015, 8:55 PM
mforns claimed this task.
mforns raised the priority of this task from to Needs Triage.
mforns updated the task description. (Show Details)
mforns added subscribers: mforns, Nuria, Milimetric, JAllemandou.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 14 2015, 8:55 PM
mforns moved this task from Next Up to In Progress on the Analytics-Kanban board.Apr 14 2015, 8:55 PM

Change 204261 had a related patch set uploaded (by Mforns):
Add last event timestamp to mysql consumer log

https://gerrit.wikimedia.org/r/204261

Change 204261 merged by Milimetric:
Add last event timestamp to mysql consumer log

https://gerrit.wikimedia.org/r/204261

We found a very likely explanation to the "data gap" issues.
The involved code is in: server/eventlogging/handlers.py

events = collections.deque()
eventsBatch = collections.deque()\
worker = PeriodicThread(...)  # Child thread that executes the inserts.
worker.start()

...

batchSize = 400
while worker.is_alive():
    event = (yield)  # read the event coming from the processor
    events.append(event)
    if len(events) >= batchSize and not worker.ready.isSet():
        logger.debug('Queue is large, sending to child thread')
        last_timestamp = str(event.get('timestamp', None))
        logger.debug('Last event timestamp %s', last_timestamp)
        for i in range(0, batchSize):
            eventsBatch.append(events.popleft())
        worker.ready.set()

Let's look at the following flow:

  1. events buffer will get filled until it contains 400 events.
  2. At that point, all 400 events are popped from events and dumped into eventsBatch.
  3. The worker ready flag is set, so the worker will execute the inserts.
             400
Parent |------|-----------------------------------
Worker |------|-----------------------------------
            start
  1. While the worker is working, events buffer reaches 400 events again.
  2. Again, all 400 events are popped from events and dumped into eventsBatch.
  3. The worker ready flag is set again, but the worker can not start yet (still finishing the first inserts).
  4. While the events buffer is filling, the worker finishes the first inserts.
  5. The worker acknowledges the ready flag and starts executing the second inserts.
             400    400
Parent |------|------|----------------------------
Worker |------|-------|---------------------------
            start   start
  1. With each iteration, the events buffer is a bit fuller at the time the worker starts working.
             400    400    400    400    400
Parent |------|------|------|------|------|------
Worker |------|-------|-------|-------|-------|--
            start   start   start   start   start
  1. So the events buffer keeps getting big until the process memory consumption gets too big and the system kills it. There is one more detail to fully understand this, but I'm not able to explain that in text, sorry too tricky, but please ping me if you want an explanation through Hangouts. From /var/log/kern.log:
Apr 12 15:52:32 eventlog1001 kernel: [3875744.298256] Killed process 10787 (eventlogging-co) total-vm:10132688kB, anon-rss:9821896kB, file-rss:524kB
Apr 13 17:49:29 eventlog1001 kernel: [3969232.768147] Killed process 32588 (eventlogging-co) total-vm:9767080kB, anon-rss:9456880kB, file-rss:920kB

The timestamps of these logs match with the end of the data gaps. With that, the contents of the buffer are lost. The consumer is restarted by upstart probably and continues insertion of most recent events.

Note that I placed the last_timestamp log very unhappily, logging not the last event sent to the worker, but the last event read by the consumer. This avoided us to catch the issue earlier.

kevinator renamed this task from Troubleshoot EventLogging missing chunks of data to Troubleshoot EventLogging missing chunks of data {oryx}.Apr 17 2015, 6:51 PM
kevinator triaged this task as High priority.
kevinator set Security to None.

Such a good report!

Change 205253 had a related patch set uploaded (by Mforns):
Change consumer logs to better report batch inserts

https://gerrit.wikimedia.org/r/205253

Change 205759 had a related patch set uploaded (by Mforns):
[WIP] Insert events by scid

https://gerrit.wikimedia.org/r/205759

mforns closed this task as Resolved.Apr 22 2015, 3:43 PM

backfilling task for missing data: T96874

Change 205253 merged by Milimetric:
Change consumer logs to better report batch inserts

https://gerrit.wikimedia.org/r/205253

Change 205759 merged by Milimetric:
Insert events by scid

https://gerrit.wikimedia.org/r/205759