Page MenuHomePhabricator

EventLogging broken in beta
Closed, ResolvedPublic5 Story Points

Description

EventLogging data collection in the beta cluster seems to have stopped a few days ago, on 25 January 2018:

# last timestamp in EL database
tgr@deployment-eventlog02:~$ mysql -sNh 127.0.0.1 --user=eventlogging --password=68QrOq220717816UycU1 log -e 'show tables' | ack-grep '^[A-Z]' | xargs -I '$' mysql -sNh 127.0.0.1 --user=eventlogging --password=68QrOq220717816UycU1 log -e 'select max(timestamp) from $' | sort | tail -n1
20180125111919

tgr@deployment-eventlog02:~$ ls -l /srv/log/eventlogging
total 4
-rw-r--r-- 1 eventlogging eventlogging    0 Jan 26 06:47 all-events.log
drwxrwxr-x 2 eventlogging eventlogging 4096 Jan 26 06:47 archive
-rw-r--r-- 1 eventlogging eventlogging    0 Jan 26 06:47 client-side-events.log

Event Timeline

Tgr created this task.Jan 29 2018, 10:08 PM
Restricted Application added a project: Analytics. · View Herald TranscriptJan 29 2018, 10:08 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Dbrant added a subscriber: Dbrant.Jan 29 2018, 10:57 PM
elukey added a subscriber: elukey.Jan 30 2018, 12:30 AM

I think that the kafka-jumbo hosts have their disk filled up, so el is not getting any new events :(

Tbayer added a subscriber: Tbayer.EditedJan 30 2018, 3:17 AM

The problem seems to be more general: it appears that all the EventLogging tables in Hive (in the new "event" database) are likewise lagging behind, with their last update dating to January 26, e.g. [1]. On the other hand, things look up to date in MySQL.[2]

[1]
jdbc:hive2://analytics1003.eqiad.wmnet:100> SELECT year, month, day, CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')) AS date, COUNT(*) AS events FROM event.navigationtiming WHERE year =2018 GROUP BY year, month, day ORDER BY year, month, day LIMIT 10000;

year	month	day	date	events
2018	1	1	2018-01-01	351864
2018	1	2	2018-01-02	377098
2018	1	3	2018-01-03	374133
2018	1	4	2018-01-04	371390
2018	1	5	2018-01-05	360239
2018	1	6	2018-01-06	367106
2018	1	7	2018-01-07	401630
2018	1	8	2018-01-08	401073
2018	1	9	2018-01-09	383230
2018	1	10	2018-01-10	378587
2018	1	11	2018-01-11	373086
2018	1	12	2018-01-12	361172
2018	1	13	2018-01-13	358843
2018	1	14	2018-01-14	395842
2018	1	15	2018-01-15	396190
2018	1	16	2018-01-16	389693
2018	1	17	2018-01-17	386363
2018	1	18	2018-01-18	379750
2018	1	19	2018-01-19	362491
2018	1	20	2018-01-20	357370
2018	1	21	2018-01-21	401910
2018	1	22	2018-01-22	403286
2018	1	23	2018-01-23	393751
2018	1	24	2018-01-24	385242
2018	1	25	2018-01-25	380481
2018	1	26	2018-01-26	50555
26 rows selected (52.35 seconds)
[2]
mysql:research@db1108.eqiad.wmnet [(none)]> SELECT TABLE_NAME, UPDATE_TIME FROM information_schema.TABLES WHERE TABLE_SCHEMA = 'log' ORDER BY UPDATE_TIME DESC LIMIT 20;
+------------------------------------------+---------------------+
| TABLE_NAME                               | UPDATE_TIME         |
+------------------------------------------+---------------------+
| MultimediaViewerAttribution_9758179      | 2018-01-30 03:15:51 |
| mediawiki_page_delete_2                  | 2018-01-30 03:15:51 |
| RecentChangesTopLinks_16732249           | 2018-01-30 03:15:51 |
| UploadWizardStep_11772724                | 2018-01-30 03:15:51 |
| MobileWikiAppFeedConfigure_17490595      | 2018-01-30 03:15:51 |
| GuidedTourButtonClick_13869649           | 2018-01-30 03:15:51 |
| ServerSideAccountCreation_5487345        | 2018-01-30 03:15:51 |
| Edit_17541122                            | 2018-01-30 03:15:51 |
| MobileWikiAppShareAFact_12588711         | 2018-01-30 03:15:51 |
| MobileWikiAppLinkPreview_14095177        | 2018-01-30 03:15:51 |
| Print_17630514                           | 2018-01-30 03:15:51 |
| GettingStartedRedirectImpression_7355552 | 2018-01-30 03:15:51 |
| UploadWizardUploadFlowEvent_11772717     | 2018-01-30 03:15:51 |
| MobileWikiAppReadingLists_15520526       | 2018-01-30 03:14:51 |
| Kartographer_16132745                    | 2018-01-30 03:14:51 |
| MobileWikiAppShareAFact_11331974         | 2018-01-30 03:14:51 |
| MobileWikiAppFindInPage_14586774         | 2018-01-30 03:14:51 |
| MobileWikiAppOnThisDay_17490767          | 2018-01-30 03:14:51 |
| MobileWikiAppFeed_16432467               | 2018-01-30 03:14:51 |
| PrefUpdate_5563398                       | 2018-01-30 03:14:51 |
+------------------------------------------+---------------------+
20 rows in set (0.02 sec)

So Andrew and I tried to check why the current kafka topic retention policy is not applied:

# The minimum age of a log file to be eligible for deletion due to age
log.retention.hours=168

# A size-based retention policy for logs. Segments are pruned from the log as long as the remaining
# segments don't drop below log.retention.bytes. Functions independently of log.retention.hours.
log.retention.bytes=134217728

The first attempt is https://gerrit.wikimedia.org/r/#/c/406781, together with the following:

# The maximum size of a log segment file. When this size is reached a new log segment will be created.
log.segment.bytes=134217728

We suspect that since kafka was set to create 1G segments (default setting) it would not take any action with smaller sizes. We lowered the segment's "rotation" threshold to 134M.

In order to free space I had to manually truncate the biggest segment (1G) in /srv/kafka/data/eventlogging_MobileWikiAppSessions-0. Now Eventlogging seems registering new data..

Let's see if this is enough to correctly force Kafka to respect data retention policies.

The problem seems to be more general: it appears that all the EventLogging tables in Hive (in the new "event" database) are likewise lagging behind, with their last update dating to January 26, e.g. [1]. On the other hand, things look up to date in MySQL.[2]

Aren't [1] and [2] related to Prod? I didn't get the connection sorry :(

The problem seems to be more general: it appears that all the EventLogging tables in Hive (in the new "event" database) are likewise lagging behind, with their last update dating to January 26, e.g. [1]. On the other hand, things look up to date in MySQL.[2]

Aren't [1] and [2] related to Prod? I didn't get the connection sorry :(

Yes. By "more general", I meant "also occurring outside the beta cluster".

I don't think the two things are related, but I'll triple check with Andrew tomorrow.

Thanks @elukey - yes, this was just a guess, based on the fact that the Hive tables stopped updating on the same day (January 26). To be precise, the latest timestamp in event.navigationtiming is 2018-01-26T03:59:59 right now. @Ottomata , should we open a separate bug for this?

I truncated all the eventlogging topic partitions on the deployment-kafka-jumbo hosts and restarted the brokers, let's see if the logs will fill the disk again or not.

Tgr added a comment.Jan 30 2018, 5:19 PM

Thanks! Event logging in beta seems to work fine now.

elukey changed the task status from Open to Stalled.Jan 30 2018, 11:28 PM

Thanks @elukey - yes, this was just a guess, based on the fact that the Hive tables stopped updating on the same day (January 26). To be precise, the latest timestamp in event.navigationtiming is 2018-01-26T03:59:59 right now. @Ottomata , should we open a separate bug for this?

OK, now split off into T186130: Hive EventLogging tables not updating since January 26 - feel free to close this one, based on @Tgr's input.

Nuria added a subscriber: Nuria.Jan 31 2018, 6:00 PM

@Tbayer this ticket is about our testing environment, there is no hadoop there. The kafka work @elukey is doimg is on labs.

Will keep this open for a couple of days to check the disk consumption after the last change.

Tgr added a comment.Jan 31 2018, 8:40 PM

EventLogging records now end up in /srv/log/eventlogging/all-events.log but the DB tables seem broken again.

Nuria added a comment.Jan 31 2018, 9:50 PM

I reawoke mysql in beta, stop/start really. Must have been left in a bad state. resolving but reopen if you find other issues. I did verify events are inserted on db.

Nuria closed this task as Resolved.Jan 31 2018, 9:51 PM
Nuria reopened this task as Stalled.

Sorry, keeping open per @elukey's comment

Tgr added a comment.Jan 31 2018, 10:55 PM

MySQL seems to be running (I can connect and run queries) but new events still do not appear in the tables. (I'm testing with InputDeviceDynamics_17661826 which should log data every time someone submits the registration page on desktop.)

Nuria added a comment.EditedFeb 1 2018, 12:04 AM

I see events being inserted so maybe is worth looking at your schema?

MariaDB [log]> select timestamp from MobileWikiAppShareAFact_12588711 order by timestamp desc limit 10;
+----------------+

timestamp

+----------------+

20180201000005
20180131235835
20180131235706
20180131235706
20180131235553
20180131235342
20180131235341
20180131235257
20180131235110
20180131235110

+----------------+

Nuria added a comment.Feb 1 2018, 12:13 AM

InputDeviceDynamics_17661826 has couple records from today, but it does not look there were many events sent from all-events.log

Tgr added a comment.Feb 1 2018, 12:19 AM
tgr@deployment-eventlog02:~$ date
Thu Feb  1 00:18:32 UTC 2018

tgr@deployment-eventlog02:~$ ack-grep InputDeviceDynamics /srv/log/eventlogging/all-events.log | jq '"\(.dt),\(.event.userName)"'
"2018-01-31T19:47:49,Foo bar 4"
"2018-01-31T22:03:56,Foo bar 5"
"2018-01-31T23:43:09,Foo bar 6"
"2018-01-31T23:43:53,Foo bar 7"
"2018-02-01T00:13:06,Foo bar 10"

tgr@deployment-eventlog02:~$ mysql -h 127.0.0.1 --user=eventlogging --password=68QrOq220717816UycU1 log -e 'select event_username, timestamp from InputDeviceDynamics_17661826;'
+----------------+----------------+
| event_username | timestamp      |
+----------------+----------------+
| Foo Bar 2      | 20180130170609 |
| Foo bar 3      | 20180130171801 |
| Foo bar 5      | 20180131220356 |
| Foo bar 6      | 20180131234309 |
+----------------+----------------+

So yes, it is working somewhat, but some of the events seem to get lost (delayed?).

Nuria added a comment.Feb 1 2018, 12:31 AM

So yes, it is working somewhat, but some of the events seem to get lost (delayed?).

They are delayed yes (always, EL insertion in Mysql buffers) so it might take couple minutes to see them appear, but they should not get lost (unless process is dying).

There is an error of process dying on the mysqlog that is not strange on testing environment, the connection to mysql fails, but that is the only one i can see:

2018-01-31 21:37:46,332 [27961] (MainThread) Exception caught Traceback (most recent call last):

Exceptions are on this log:

root@deployment-eventlog02:/var/log/upstart# more eventlogging_consumer-mysql-m4-master-00.log

So I would expect event Foo bar 10 shoudl be on DB. This mysql uses tokuDb, could it be that table was created with wrong engine (innoDb) and insertions are faulty?

Tgr added a comment.Feb 1 2018, 12:39 AM

Seems like all the beta eventlogging tables use InnoDB.

Tgr added a comment.Feb 1 2018, 12:44 AM
MariaDB [log]> show global variables like 'default_storage_engine';
+------------------------+--------+
| Variable_name          | Value  |
+------------------------+--------+
| default_storage_engine | InnoDB |
+------------------------+--------+

Does that come from puppet / some other git repo, or can I just change it by hand?

Nuria added a comment.Feb 1 2018, 4:07 PM

I am going to restart eventlogging from master (it was on a different changeset than what we have now inprod). Let's take a look again today.

Nuria claimed this task.Feb 2 2018, 5:09 PM
Nuria edited projects, added Analytics-Kanban; removed Analytics.
Nuria set the point value for this task to 5.
Nuria moved this task from Next Up to In Progress on the Analytics-Kanban board.
elukey moved this task from Backlog to Done on the User-Elukey board.Feb 5 2018, 5:37 PM
Nuria moved this task from In Progress to Done on the Analytics-Kanban board.Feb 9 2018, 5:13 PM
Krinkle updated the task description. (Show Details)Feb 21 2018, 8:45 AM

As FYI we migrated deployment-eventlog02 to deployment-eventlog05 :)

Nuria closed this task as Resolved.Mar 2 2018, 9:33 PM