Troubleshoot EventLogging missing chunks of data {oryx}
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	mforns
	Apr 14 2015, 8:55 PM

Description

We've observed that EL database (m4-master.equiad.wmnet) has some "holes", some time spans that contain no data for any event table. For instance:

2015-04-09 17:20 -> 18:35
2015-04-11 03:35 -> 05:20
2015-04-11 14:00 -> 15:20
2015-04-11 19:00 -> 20:00
2015-04-12 14:30 -> 15:40
2015-04-13 11:35 -> 12:30
2015-04-13 16:30 -> 17:35

We know that:

EL processor is validating correctly the missing events and writing the correspondent logs.
EL mysql consumer is homogeneously writing events at the time of the holes. But we can not tell if these events are the ones that are missing or not, because the consumer logs indicate only the insertion timestamp and not the event timestamp.
The database does not receive any inserts for the missing data.

Details

Subject	Repo	Branch	Lines +/-
Insert events by scid	mediawiki/extensions/EventLogging	master	+271 -45
Change consumer logs to better report batch inserts	mediawiki/extensions/EventLogging	master	+6 -4
Add last event timestamp to mysql consumer log	mediawiki/extensions/EventLogging	master	+2 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		mforns	T96872 Inserts events by scid {oryx}
		Resolved		mforns	T96082 Troubleshoot EventLogging missing chunks of data {oryx}

Event Timeline

mforns created this task.Apr 14 2015, 8:55 PM

mforns claimed this task.

mforns raised the priority of this task from to Needs Triage.

mforns updated the task description. (Show Details)

mforns added projects: Analytics-Kanban, MediaWiki-extensions-EventLogging.

mforns added subscribers: mforns, • Nuria, Milimetric, JAllemandou.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 14 2015, 8:55 PM

mforns moved this task from Next Up to In Progress on the Analytics-Kanban board.Apr 14 2015, 8:55 PM

Change 204261 had a related patch set uploaded (by Mforns):
Add last event timestamp to mysql consumer log

https://gerrit.wikimedia.org/r/204261

gerritbot added a project: Patch-For-Review.Apr 15 2015, 1:51 PM

Change 204261 merged by Milimetric:
Add last event timestamp to mysql consumer log

https://gerrit.wikimedia.org/r/204261

mforns mentioned this in rMEXT2ff481942f39: Updated mediawiki/extensions Project: mediawiki/extensions/EventLogging….Apr 15 2015, 2:14 PM

mforns mentioned this in rEEVLd241d75ee2fa: Add last event timestamp to mysql consumer log.

We found a very likely explanation to the "data gap" issues.
The involved code is in: server/eventlogging/handlers.py

events = collections.deque()
eventsBatch = collections.deque()\
worker = PeriodicThread(...)  # Child thread that executes the inserts.
worker.start()

...

batchSize = 400
while worker.is_alive():
    event = (yield)  # read the event coming from the processor
    events.append(event)
    if len(events) >= batchSize and not worker.ready.isSet():
        logger.debug('Queue is large, sending to child thread')
        last_timestamp = str(event.get('timestamp', None))
        logger.debug('Last event timestamp %s', last_timestamp)
        for i in range(0, batchSize):
            eventsBatch.append(events.popleft())
        worker.ready.set()

Let's look at the following flow:

events buffer will get filled until it contains 400 events.
At that point, all 400 events are popped from events and dumped into eventsBatch.
The worker ready flag is set, so the worker will execute the inserts.

             400
Parent |------|-----------------------------------
Worker |------|-----------------------------------
            start

While the worker is working, events buffer reaches 400 events again.
Again, all 400 events are popped from events and dumped into eventsBatch.
The worker ready flag is set again, but the worker can not start yet (still finishing the first inserts).
While the events buffer is filling, the worker finishes the first inserts.
The worker acknowledges the ready flag and starts executing the second inserts.

             400    400
Parent |------|------|----------------------------
Worker |------|-------|---------------------------
            start   start

With each iteration, the events buffer is a bit fuller at the time the worker starts working.

             400    400    400    400    400
Parent |------|------|------|------|------|------
Worker |------|-------|-------|-------|-------|--
            start   start   start   start   start

So the events buffer keeps getting big until the process memory consumption gets too big and the system kills it. There is one more detail to fully understand this, but I'm not able to explain that in text, sorry too tricky, but please ping me if you want an explanation through Hangouts. From /var/log/kern.log:

Apr 12 15:52:32 eventlog1001 kernel: [3875744.298256] Killed process 10787 (eventlogging-co) total-vm:10132688kB, anon-rss:9821896kB, file-rss:524kB
Apr 13 17:49:29 eventlog1001 kernel: [3969232.768147] Killed process 32588 (eventlogging-co) total-vm:9767080kB, anon-rss:9456880kB, file-rss:920kB

The timestamps of these logs match with the end of the data gaps. With that, the contents of the buffer are lost. The consumer is restarted by upstart probably and continues insertion of most recent events.

Note that I placed the last_timestamp log very unhappily, logging not the last event sent to the worker, but the last event read by the consumer. This avoided us to catch the issue earlier.

• kevinator renamed this task from Troubleshoot EventLogging missing chunks of data to Troubleshoot EventLogging missing chunks of data {oryx}.Apr 17 2015, 6:51 PM

• kevinator triaged this task as High priority.

• kevinator set Security to None.

Such a good report!

Change 205253 had a related patch set uploaded (by Mforns):
Change consumer logs to better report batch inserts

https://gerrit.wikimedia.org/r/205253

Change 205759 had a related patch set uploaded (by Mforns):
[WIP] Insert events by scid

https://gerrit.wikimedia.org/r/205759

mforns closed this task as Resolved.Apr 22 2015, 3:43 PM

backfilling task for missing data: T96874

• kevinator moved this task from In Progress to Done on the Analytics-Kanban board.Apr 22 2015, 3:47 PM

• kevinator added a parent task: T96872: Inserts events by scid {oryx}.Apr 23 2015, 3:40 PM

Change 205253 merged by Milimetric:
Change consumer logs to better report batch inserts