Page MenuHomePhabricator

Schema:Popups suddenly stopped logging events in MariaDB, but they are still being sent according to Grafana
Closed, ResolvedPublic

Description

The log.Popups_16364296 EventLogging table suddently stopped recording events around 6:37pm UTC on August 30. But the schema is still sending them according to Grafana.

There was no schema revision change around that time, and in any case previous revision changes did not result in such a sharp drop to exactly zero. It's also hard to see how it could have to do with the sampling rate increase earlier that day - the table still seems to have recorded events fine at the increased rate for several hours before it stopped doing so, and according to Grafana, the event rate was already decreasing from its daily peak around that time.

SELECT DATE(timestamp) AS date, HOUR(timestamp) AS hour, COUNT(*) AS events
FROM log.Popups_16364296
WHERE DATE(timestamp) >= '2017-08-29'
GROUP BY date, hour
ORDER BY date, hour;

+------------+------+--------+
| date       | hour | events |
+------------+------+--------+
| 2017-08-29 |    0 |   7125 |
| 2017-08-29 |    1 |   8380 |
| 2017-08-29 |    2 |   8548 |
| 2017-08-29 |    3 |   8753 |
| 2017-08-29 |    4 |   8976 |
| 2017-08-29 |    5 |   9951 |
| 2017-08-29 |    6 |  14996 |
| 2017-08-29 |    7 |  17978 |
| 2017-08-29 |    8 |  23144 |
| 2017-08-29 |    9 |  26012 |
| 2017-08-29 |   10 |  28651 |
| 2017-08-29 |   11 |  31647 |
| 2017-08-29 |   12 |  36948 |
| 2017-08-29 |   13 |  40790 |
| 2017-08-29 |   14 |  43676 |
| 2017-08-29 |   15 |  45534 |
| 2017-08-29 |   16 |  46176 |
| 2017-08-29 |   17 |  47115 |
| 2017-08-29 |   18 |  50516 |
| 2017-08-29 |   19 |  50092 |
| 2017-08-29 |   20 |  47156 |
| 2017-08-29 |   21 |  39199 |
| 2017-08-29 |   22 |  30009 |
| 2017-08-29 |   23 |  27352 |
| 2017-08-30 |    0 |  28923 |
| 2017-08-30 |    1 |  28045 |
| 2017-08-30 |    2 |  27460 |
| 2017-08-30 |    3 |  26989 |
| 2017-08-30 |    4 |  28453 |
| 2017-08-30 |    5 |  29657 |
| 2017-08-30 |    6 |  39342 |
| 2017-08-30 |    7 |  44692 |
| 2017-08-30 |    8 |  50698 |
| 2017-08-30 |    9 |  56383 |
| 2017-08-30 |   10 |  56774 |
| 2017-08-30 |   11 |  65572 |
| 2017-08-30 |   12 |  71129 |
| 2017-08-30 |   13 |  93465 |
| 2017-08-30 |   14 | 742229 |
| 2017-08-30 |   15 | 721424 |
| 2017-08-30 |   16 | 668968 |
| 2017-08-30 |   17 | 643507 |
| 2017-08-30 |   18 | 392597 |
+------------+------+--------+
43 rows in set (12.81 sec)
SELECT DATE(timestamp) AS date, HOUR(timestamp) AS hour, MINUTE (timestamp) AS minute, 
COUNT(*) AS events
FROM log.Popups_16364296
WHERE timestamp LIKE '2017083018%'
GROUP BY date, hour, minute
ORDER BY date, hour, minute;

+------------+------+--------+--------+
| date       | hour | minute | events |
+------------+------+--------+--------+
| 2017-08-30 |   18 |      0 |  10884 |
| 2017-08-30 |   18 |      1 |  10605 |
| 2017-08-30 |   18 |      2 |  10303 |
| 2017-08-30 |   18 |      3 |  10073 |
| 2017-08-30 |   18 |      4 |  10362 |
| 2017-08-30 |   18 |      5 |  10492 |
| 2017-08-30 |   18 |      6 |  10082 |
| 2017-08-30 |   18 |      7 |  10222 |
| 2017-08-30 |   18 |      8 |  10383 |
| 2017-08-30 |   18 |      9 |  10919 |
| 2017-08-30 |   18 |     10 |  10476 |
| 2017-08-30 |   18 |     11 |  10984 |
| 2017-08-30 |   18 |     12 |  10768 |
| 2017-08-30 |   18 |     13 |  11266 |
| 2017-08-30 |   18 |     14 |  10756 |
| 2017-08-30 |   18 |     15 |  10831 |
| 2017-08-30 |   18 |     16 |  10860 |
| 2017-08-30 |   18 |     17 |  10762 |
| 2017-08-30 |   18 |     18 |  10798 |
| 2017-08-30 |   18 |     19 |  10985 |
| 2017-08-30 |   18 |     20 |  11085 |
| 2017-08-30 |   18 |     21 |  10873 |
| 2017-08-30 |   18 |     22 |  11180 |
| 2017-08-30 |   18 |     23 |  10509 |
| 2017-08-30 |   18 |     24 |   9992 |
| 2017-08-30 |   18 |     25 |  10490 |
| 2017-08-30 |   18 |     26 |  10542 |
| 2017-08-30 |   18 |     27 |  10694 |
| 2017-08-30 |   18 |     28 |  10684 |
| 2017-08-30 |   18 |     29 |  11034 |
| 2017-08-30 |   18 |     30 |  11000 |
| 2017-08-30 |   18 |     31 |  11082 |
| 2017-08-30 |   18 |     32 |  10510 |
| 2017-08-30 |   18 |     33 |  10855 |
| 2017-08-30 |   18 |     34 |  11173 |
| 2017-08-30 |   18 |     35 |  10790 |
| 2017-08-30 |   18 |     36 |  10743 |
| 2017-08-30 |   18 |     37 |   6540 |
+------------+------+--------+--------+
38 rows in set (0.51 sec)

Details

Event Timeline

Restricted Application added a project: Analytics. · View Herald TranscriptSep 1 2017, 6:38 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

(Not tagging this with Readers-Web-Backlog for now because it doesn't not seem to be an issue with the instrumentation at this point, but perhaps it's worth putting it into the "Tracking" column there.)

Investigated that tonight:

  • No error on grafana nor in the logs
  • Data is inserted on master correctly (the exact same query @HaeB pasted above gives up-to-date results when querying on master.

Looks like we still have replication problems @Ottomata :(

Tbayer added a comment.EditedSep 1 2017, 7:37 PM

Thanks! PS, BTW: The above queries were run on analytics-store. The copy on s1-analytics-slave replicated for 21 more minutes:

mysql:research@s1-analytics-slave.eqiad.wmnet [(none)]> SELECT DATE(timestamp) AS date, HOUR(timestamp) AS hour, MINUTE (timestamp) AS minute, 
COUNT(*) AS events
FROM log.Popups_16364296
WHERE timestamp LIKE '2017083018%' OR timestamp LIKE '2017083019%'
GROUP BY date, hour, minute
ORDER BY date, hour, minute;

+------------+------+--------+--------+
| date       | hour | minute | events |
+------------+------+--------+--------+
| 2017-08-30 |   18 |      0 |  10884 |
| 2017-08-30 |   18 |      1 |  10605 |
| 2017-08-30 |   18 |      2 |  10303 |
| 2017-08-30 |   18 |      3 |  10073 |
| 2017-08-30 |   18 |      4 |  10362 |
| 2017-08-30 |   18 |      5 |  10492 |
| 2017-08-30 |   18 |      6 |  10082 |
| 2017-08-30 |   18 |      7 |  10222 |
| 2017-08-30 |   18 |      8 |  10383 |
| 2017-08-30 |   18 |      9 |  10919 |
| 2017-08-30 |   18 |     10 |  10476 |
| 2017-08-30 |   18 |     11 |  10984 |
| 2017-08-30 |   18 |     12 |  10768 |
| 2017-08-30 |   18 |     13 |  11266 |
| 2017-08-30 |   18 |     14 |  10756 |
| 2017-08-30 |   18 |     15 |  10831 |
| 2017-08-30 |   18 |     16 |  10860 |
| 2017-08-30 |   18 |     17 |  10762 |
| 2017-08-30 |   18 |     18 |  10798 |
| 2017-08-30 |   18 |     19 |  10985 |
| 2017-08-30 |   18 |     20 |  11085 |
| 2017-08-30 |   18 |     21 |  10873 |
| 2017-08-30 |   18 |     22 |  11180 |
| 2017-08-30 |   18 |     23 |  10509 |
| 2017-08-30 |   18 |     24 |   9992 |
| 2017-08-30 |   18 |     25 |  10490 |
| 2017-08-30 |   18 |     26 |  10542 |
| 2017-08-30 |   18 |     27 |  10694 |
| 2017-08-30 |   18 |     28 |  10684 |
| 2017-08-30 |   18 |     29 |  11034 |
| 2017-08-30 |   18 |     30 |  11000 |
| 2017-08-30 |   18 |     31 |  11082 |
| 2017-08-30 |   18 |     32 |  10510 |
| 2017-08-30 |   18 |     33 |  10855 |
| 2017-08-30 |   18 |     34 |  11173 |
| 2017-08-30 |   18 |     35 |  10790 |
| 2017-08-30 |   18 |     36 |  10743 |
| 2017-08-30 |   18 |     37 |  11082 |
| 2017-08-30 |   18 |     38 |  10676 |
| 2017-08-30 |   18 |     39 |  10629 |
| 2017-08-30 |   18 |     40 |  10715 |
| 2017-08-30 |   18 |     41 |  10373 |
| 2017-08-30 |   18 |     42 |  11161 |
| 2017-08-30 |   18 |     43 |  10712 |
| 2017-08-30 |   18 |     44 |  11085 |
| 2017-08-30 |   18 |     45 |  10539 |
| 2017-08-30 |   18 |     46 |  10815 |
| 2017-08-30 |   18 |     47 |  11280 |
| 2017-08-30 |   18 |     48 |  11299 |
| 2017-08-30 |   18 |     49 |  11364 |
| 2017-08-30 |   18 |     50 |  11408 |
| 2017-08-30 |   18 |     51 |  11332 |
| 2017-08-30 |   18 |     52 |  11472 |
| 2017-08-30 |   18 |     53 |  11261 |
| 2017-08-30 |   18 |     54 |  11820 |
| 2017-08-30 |   18 |     55 |  11052 |
| 2017-08-30 |   18 |     56 |  11085 |
| 2017-08-30 |   18 |     57 |  10765 |
| 2017-08-30 |   18 |     58 |   3574 |
+------------+------+--------+--------+
59 rows in set (2.07 sec)
Tbayer removed a subscriber: HaeB.Sep 1 2017, 7:38 PM
elukey added a subscriber: elukey.Sep 4 2017, 1:42 PM

On db1047 (analytics-slave) I've ran the query in the description and found:

[..]
| 2017-08-30 |    0 |  28923 |
| 2017-08-30 |    1 |  28045 |
| 2017-08-30 |    2 |  27460 |
| 2017-08-30 |    3 |  26989 |
| 2017-08-30 |    4 |  28453 |
| 2017-08-30 |    5 |  29657 |
| 2017-08-30 |    6 |  39342 |
| 2017-08-30 |    7 |  44692 |
| 2017-08-30 |    8 |  50698 |
| 2017-08-30 |    9 |  56383 |
| 2017-08-30 |   10 |  56774 |
| 2017-08-30 |   11 |  65572 |
| 2017-08-30 |   12 |  71129 |
| 2017-08-30 |   13 |  93465 |
| 2017-08-30 |   14 | 742229 |
| 2017-08-30 |   15 | 721424 |
| 2017-08-30 |   16 | 668968 |
| 2017-08-30 |   17 | 643507 |
| 2017-08-30 |   18 | 649306 |
| 2017-08-30 |   19 | 662889 |
| 2017-08-30 |   20 | 616401 |
| 2017-08-30 |   21 | 507900 |
| 2017-08-30 |   22 | 394421 |
| 2017-08-30 |   23 | 349606 |
| 2017-08-31 |    0 | 333121 |
| 2017-08-31 |    1 | 343687 |
| 2017-08-31 |    2 | 342440 |
| 2017-08-31 |    3 | 322379 |
| 2017-08-31 |    4 | 262239 |
+------------+------+--------+

Data up to 2017-08-31-4 seems consistent with what I am seeing on m4-master, not sure why it takes so much time to propagate the updates.

...

Looks like we still have replication problems @Ottomata :(

For reference, I think Joseph was referring to the fact that earlier, similar issues (such as T155639#3068167) were meant to have been addressed with T124307.

Now at 10am on August 31:

SELECT DATE(timestamp) AS date, HOUR(timestamp) AS hour, COUNT(*) AS events
FROM log.Popups_16364296
WHERE DATE(timestamp) >= '2017-08-29'
GROUP BY date, hour
ORDER BY date, hour;

+------------+------+--------+
| date       | hour | events |
+------------+------+--------+
| 2017-08-29 |    0 |   7125 |
| 2017-08-29 |    1 |   8380 |
| 2017-08-29 |    2 |   8548 |
| 2017-08-29 |    3 |   8753 |
| 2017-08-29 |    4 |   8976 |
| 2017-08-29 |    5 |   9951 |
| 2017-08-29 |    6 |  14996 |
| 2017-08-29 |    7 |  17978 |
| 2017-08-29 |    8 |  23144 |
| 2017-08-29 |    9 |  26012 |
| 2017-08-29 |   10 |  28651 |
| 2017-08-29 |   11 |  31647 |
| 2017-08-29 |   12 |  36948 |
| 2017-08-29 |   13 |  40790 |
| 2017-08-29 |   14 |  43676 |
| 2017-08-29 |   15 |  45534 |
| 2017-08-29 |   16 |  46176 |
| 2017-08-29 |   17 |  47115 |
| 2017-08-29 |   18 |  50516 |
| 2017-08-29 |   19 |  50092 |
| 2017-08-29 |   20 |  47156 |
| 2017-08-29 |   21 |  39199 |
| 2017-08-29 |   22 |  30009 |
| 2017-08-29 |   23 |  27352 |
| 2017-08-30 |    0 |  28923 |
| 2017-08-30 |    1 |  28045 |
| 2017-08-30 |    2 |  27460 |
| 2017-08-30 |    3 |  26989 |
| 2017-08-30 |    4 |  28453 |
| 2017-08-30 |    5 |  29657 |
| 2017-08-30 |    6 |  39342 |
| 2017-08-30 |    7 |  44692 |
| 2017-08-30 |    8 |  50698 |
| 2017-08-30 |    9 |  56383 |
| 2017-08-30 |   10 |  56774 |
| 2017-08-30 |   11 |  65572 |
| 2017-08-30 |   12 |  71129 |
| 2017-08-30 |   13 |  93465 |
| 2017-08-30 |   14 | 742229 |
| 2017-08-30 |   15 | 721424 |
| 2017-08-30 |   16 | 668968 |
| 2017-08-30 |   17 | 643507 |
| 2017-08-30 |   18 | 649306 |
| 2017-08-30 |   19 | 662889 |
| 2017-08-30 |   20 | 616401 |
| 2017-08-30 |   21 | 507900 |
| 2017-08-30 |   22 | 394421 |
| 2017-08-30 |   23 | 349606 |
| 2017-08-31 |    0 | 333121 |
| 2017-08-31 |    1 | 343687 |
| 2017-08-31 |    2 | 342440 |
| 2017-08-31 |    3 | 322379 |
| 2017-08-31 |    4 | 326741 |
| 2017-08-31 |    5 | 367186 |
| 2017-08-31 |    6 | 477614 |
| 2017-08-31 |    7 | 550245 |
| 2017-08-31 |    8 | 621029 |
| 2017-08-31 |    9 | 661480 |
| 2017-08-31 |   10 | 690405 |
| 2017-08-31 |   11 | 104810 |
+------------+------+--------+
60 rows in set (22.09 sec)
elukey added a comment.EditedSep 5 2017, 9:35 AM

The strange thing is that the following happens over and over:

elukey@dbstore1002:/var/log$ tail -n 1000 eventlogging_sync.log | grep Popups_16364296
2017-09-05T09:31:53 localhost log Popups_16364296_15423246 (no new data on master in last 90 days, skipping)
2017-09-05T09:32:15 localhost log Popups_16364296_15423246 (no new data on master in last 90 days, skipping)
2017-09-05T09:32:17 localhost log Popups_16364296 (rows!) ok
2017-09-05T09:33:26 localhost log Popups_16364296_15423246 (no new data on master in last 90 days, skipping)
2017-09-05T09:33:33 localhost log Popups_16364296 (rows!) ok

This is a normal behavior afaics but a bit weird since the slave is so much behind the master.

In the error log I can see the following repeating but probably only a warning since the table does not have the id:

mysqldump: Couldn't execute 'SELECT /*!40001 SQL_NO_CACHE */ * FROM `NavigationTiming_13317958` WHERE id >= 20170420192612 ORDER BY id LIMIT 1000': Unknown column 'id' in 'where clause' (1054)

Change 375977 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] eventlogging_sync: fix bug in id/timestamp column handling

https://gerrit.wikimedia.org/r/375977

elukey added a comment.Sep 5 2017, 9:52 AM

The above patch is not the root cause of the issue since the script loops over all the eventlogging tables, but it is nice to fix in my opinion.

The strange thing is that the following happens over and over:

elukey@dbstore1002:/var/log$ tail -n 1000 eventlogging_sync.log | grep Popups_16364296
2017-09-05T09:31:53 localhost log Popups_16364296_15423246 (no new data on master in last 90 days, skipping)
2017-09-05T09:32:15 localhost log Popups_16364296_15423246 (no new data on master in last 90 days, skipping)
2017-09-05T09:32:17 localhost log Popups_16364296 (rows!) ok
2017-09-05T09:33:26 localhost log Popups_16364296_15423246 (no new data on master in last 90 days, skipping)
2017-09-05T09:33:33 localhost log Popups_16364296 (rows!) ok

This is a normal behavior afaics but a bit weird since the slave is so much behind the master.

Turns out that I am stupid, the following query looks better and it is more informative:

tail -n 100000 eventlogging_sync.log | grep Popups_16364296\ 

2017-09-05T09:39:24 localhost log Popups_16364296 (rows!) ok
2017-09-05T09:40:26 localhost log Popups_16364296 (rows!) ok
2017-09-05T09:40:59 localhost log Popups_16364296 (rows!) ok
2017-09-05T09:41:31 localhost log Popups_16364296 (rows!) ok
2017-09-05T09:42:25 localhost log Popups_16364296 (rows!) ok
2017-09-05T09:43:05 localhost log Popups_16364296 (rows!) ok
2017-09-05T09:44:26 localhost log Popups_16364296 (rows!) ok
2017-09-05T09:45:03 localhost log Popups_16364296 (rows!) ok
2017-09-05T09:46:02 localhost log Popups_16364296 (rows!) ok
2017-09-05T09:46:44 localhost log Popups_16364296 (rows!) ok
2017-09-05T09:47:37 localhost log Popups_16364296 (rows!) ok
2017-09-05T09:48:12 localhost log Popups_16364296 (rows!) ok
2017-09-05T09:49:28 localhost log Popups_16364296 (rows!) ok
2017-09-05T09:49:49 localhost log Popups_16364296 (rows!) ok
2017-09-05T09:50:49 localhost log Popups_16364296 (rows!) ok
2017-09-05T09:52:04 localhost log Popups_16364296 (rows!) ok
2017-09-05T09:52:40 localhost log Popups_16364296 (rows!) ok
2017-09-05T09:54:01 localhost log Popups_16364296 (rows!) ok
2017-09-05T09:54:34 localhost log Popups_16364296 (rows!) ok
2017-09-05T09:55:17 localhost log Popups_16364296 (rows!) ok
2017-09-05T09:56:33 localhost log Popups_16364296 (rows!) ok
2017-09-05T09:57:20 localhost log Popups_16364296 (rows!) ok
2017-09-05T09:57:46 localhost log Popups_16364296 (rows!) ok
2017-09-05T09:59:10 localhost log Popups_16364296 (rows!) ok
2017-09-05T09:59:37 localhost log Popups_16364296 (rows!) ok
2017-09-05T10:00:49 localhost log Popups_16364296 (rows!) ok
2017-09-05T10:02:35 localhost log Popups_16364296 (rows!) ok
2017-09-05T10:03:15 localhost log Popups_16364296 (rows!) ok
2017-09-05T10:04:19 localhost log Popups_16364296 (rows!) ok
2017-09-05T10:05:38 localhost log Popups_16364296 (rows!) ok
2017-09-05T10:06:38 localhost log Popups_16364296 (rows!) ok
2017-09-05T10:08:08 localhost log Popups_16364296 (rows!) ok

The eventlogging_sync script is probably lagging behind since the batch size is only 1000...

Change 375977 merged by Elukey:
[operations/puppet@production] eventlogging_sync: fix bug in id/timestamp column handling

https://gerrit.wikimedia.org/r/375977

elukey claimed this task.Sep 5 2017, 10:25 AM
elukey edited projects, added Analytics-Kanban; removed Patch-For-Review, Analytics.
elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.

Change 376201 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::mariadb::analytics::custom_repl_slave: raise el_sync batch to 10k

https://gerrit.wikimedia.org/r/376201

Change 376201 merged by Elukey:
[operations/puppet@production] role::mariadb::analytics::custom_repl_slave: raise el_sync batch to 10k

https://gerrit.wikimedia.org/r/376201

elukey added a comment.Sep 6 2017, 9:59 AM

@Tbayer let's see how it goes with a bigger batch size, data should replicate in a faster way now but I am pretty sure it will not be almost real time as we are used for the other low volume tables. The script does a mysqdump for every table in the log database, so some minutes pass between each batch:

2017-09-06T09:43:36 localhost log Popups_16364296 (rows!) ok
2017-09-06T09:44:07 localhost log Popups_16364296 (rows!) ok
2017-09-06T09:46:20 localhost log Popups_16364296 (rows!) ok
2017-09-06T09:46:32 localhost log Popups_16364296 (rows!) ok
2017-09-06T09:47:40 localhost log Popups_16364296 (rows!) ok
2017-09-06T09:47:49 localhost log Popups_16364296 (rows!) ok
2017-09-06T09:49:16 localhost log Popups_16364296 (rows!) ok
2017-09-06T09:49:54 localhost log Popups_16364296 (rows!) ok
2017-09-06T09:52:14 localhost log Popups_16364296 (rows!) ok

For Popups_16364296 the time varies between seconds and minutes, this is why the slave is so much behind the master. I wouldn't push the batch size much further from 10k to avoid any issue with the current replication performance, but we could think about other custom ways (like doing a manual dump periodically with eventlogging_sync stopped only for Popups_16364296 to speed up the replication). Let me know how urgent is for you to check the newest data for Popups_16364296 and I'll try to find a solution accordingly :)

Some numbers:

elukey@db1046:~$ sudo mysql -h localhost --compress --skip-column-names --skip-ssl <<< "select ifnull(max(id),0) from log.Popups_16364296"
117229498

elukey@dbstore1002:~$ sudo mysql -h localhost --compress --skip-column-names --skip-ssl <<< "select ifnull(max(id),0) from log.Popups_16364296"
33923960

With the new settings the slave would get the same data as the master with 8330 batches (10k each). Assuming 2 minutes between each batch, we'd end up with 277 hours (~11 days). I strongly believe that we need to manual sync the slave manually for this use case.

Thanks @elukey - right now, while the A/B test is still running, it's not too urgent to be able to check the latest data in real-time (although it would be great to get our hands on the Sep 1 data soon, to be able to assess the effect of a bug fix T172291#3572535).

However, such an 11-day delay may become a concern after the end of the A/B test (around September 16).

BTW after this is sorted out, let's not forget to document what we are learning here about EventLogging replication limitations somewhere on Wikitech.

elukey added a comment.EditedSep 7 2017, 1:40 PM

So I reverted the batch size to 1000 since it was not really getting us any benefit (and it was impacting the disk usage, that is already high due to the sanitization script running), I'll sync with Andrew to do a manual sync as soon as possible for the Popups_16364296 schema to unblock your analysis.

Nuria added a subscriber: Nuria.EditedSep 7 2017, 6:25 PM

@Tbayer We talked about this and the mysql issues are many, what we can do is to make this data available on hadoop in a table form, similar (but not identical) to the couple tables we migrated recently. The gist of it is that mysql issues are many, among them the hardware and as we have mentioned before for big amounts of data MySQL just does not work.

Nuria added a comment.Sep 7 2017, 10:05 PM

@Tbayer : also please take a look at rate of intake of events, it is quite a bit higher than 100 sec on average, more like 150 per sec: https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?refresh=5m&orgId=1&var-schema=Popups&from=now-30d&to=now

@Tbayer : also please take a look at rate of intake of events, it is quite a bit higher than 100 sec on average, more like 150 per sec: https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?refresh=5m&orgId=1&var-schema=Popups&from=now-30d&to=now

Indeed. We were definitely aiming for around 100 instead, but maybe the traffic estimate was off a bit (CCing @phuedx who handled that part, in case he has any insight on this); another possibility might just be that user behavior on these two wikis differs a lot from those in the previous experiment where the rate was extrapolated from. I would really like to avoid changing the rate mid-experiment though, as it would create a lot of additional adjustment work.

Tbayer added a comment.EditedSep 8 2017, 12:47 AM

@Tbayer We talked about this and the mysql issues are many, what we can do is to make this data available on hadoop in a table form, similar (but not identical) to the couple tables we migrated recently. The gist of it is that mysql issues are many, among them the hardware and as we have mentioned before for big amounts of data MySQL just does not work.

Thanks for the offer! I would have preferred to do the analysis in MySQL/MariaDB as usual (the query times don't seem too bad so far BTW, and moving to Hive will involve extra work for me in rewriting all the previously used queries). But if these problems are severe, I guess that's the best option at this point, also considering that it worked well in that recent example (T172322#3526095 ). Do note though that we will need all the fields. Also, we should still keep the MySQL table (with the existing purging policy) in case we need to fall back to it.

Regarding these many MySQL issues: As mentioned above, it would be great if we could document these new learnings about MySQL and EventLogging from this incident somewhere. Really appreciate everyone helping to sort them out; overall I do want to mention that when we (the Readers team) reached out about the event rate beforehand, everyone's concerns focused on the current disk space problem instead of potential issues with the intake rate / replication per se.

elukey added a comment.Sep 8 2017, 3:17 PM

Thanks for the offer! I would have preferred to do the analysis in MySQL/MariaDB as usual (the query times don't seem too bad so far BTW, and moving to Hive will involve extra work for me in rewriting all the previously used queries). But if these problems are severe, I guess that's the best option at this point, also considering that it worked well in that recent example (T172322#3526095 ). Do note though that we will need all the fields. Also, we should still keep the MySQL table (with the existing purging policy) in case we need to fall back to it.

We are definitely going to keep the data in mysql, but as mentioned before it will take a bit of time before the slaves will catch up with the master. I completely agree that doing extra work for this analysis is not great, and we are sorry for it. We learned several limitations of the current Event Logging replication (that as you know is kind of special). The Analytics team is of course available for any questions/follow-ups about Hive/HDFS if needed!

Regarding these many MySQL issues: As mentioned above, it would be great if we could document these new learnings about MySQL and EventLogging from this incident somewhere.

I take the responsibility to update the EventLogging documentation before closing the task.

Really appreciate everyone helping to sort them out; overall I do want to mention that when we (the Readers team) reached out about the event rate beforehand, everyone's concerns focused on the current disk space problem instead of potential issues with the intake rate / replication per se.

To be clear, we are not blaming anybody for these Mysql problems; we brought up these issues in the task to let everybody know about what was happening and what was our (as Analytics team) response for it :)

Thanks for the patience!

Nuria added a comment.EditedSep 8 2017, 7:49 PM

Regarding these many MySQL issues: As mentioned above, it would be great if we could document these new learnings about MySQL and EventLogging from this incident somewhere.

We can further document but I really do not think these are new learnings, we have highlighted frequently how the mysql backend of eventlogging is not well suited for high volume, in the future let's not use it for loads of data, it is just not going to cut it, replication is just one issue there are others like disk space and hardware. Some prior replication issues with similar inflow of events:
https://phabricator.wikimedia.org/T123634#1934482

You have a table in hive (on my database 'nuria') called popups that should have all data from Sun Aug 27 2017 to September 7th. A sample query:

select useragent, event.popupDelay from popups where year=2017 limit 10;

Note that event is a struct field on hive so structure is different than MySQL table where event is flatten out.

Nuria moved this task from In Progress to Paused on the Analytics-Kanban board.Sep 13 2017, 3:14 PM

You have a table in hive (on my database 'nuria') called popups that should have all data from Sun Aug 27 2017 to September 7th. A sample query:
select useragent, event.popupDelay from popups where year=2017 limit 10;
Note that event is a struct field on hive so structure is different than MySQL table where event is flatten out.

Thanks again for this! As mentioned elsewhere last week, I already used it for various queries and so far Hive worked for all of them.

Now that the experiment has concluded, could we update the Hive copy with the entire data until Sep 18? Also, if there is a way to copy the full table (log.Popups_16364296) with data back to March 2017), that could make things easier in the future.

Nuria added a comment.Sep 21 2017, 7:12 PM

@Tbayer: we can put all data in hadoop and remove it entirely from MySQL, does that sound good?. I should be able to refine all data in hadoop in one go and put it on my db for you to take a look, once you let me know is good we can drop table in MySQL and move the popUps table to archive database. Let us know if this sounds good.

@Tbayer: we can put all data in hadoop and remove it entirely from MySQL, does that sound good?. I should be able to refine all data in hadoop in one go and put it on my db for you to take a look, once you let me know is good we can drop table in MySQL and move the popUps table to archive database. Let us know if this sounds good.

That sounds reasonable, yes. Just a heads-up that I will need to find some time to do these checks (ensuring we have all the necessary queries adapted from MariaDB and did not accidentally lose data in transition), which likely won't happen before the end of the month. (CCing @ovasileva FYI)

We are talking about one Popup table right? (well, two probably Popups_16364296 and Popups_16364296_15423246) Can we do away with the rest of the tables or you also need those too?

mysql> select max(timestamp) from Popups_16364296;
+----------------+

max(timestamp)

+----------------+

20170921222554

+----------------+
1 row in set (0.00 sec)

mysql> select min(timestamp) from Popups_16364296;
+----------------+

min(timestamp)

+----------------+

20170329170308

+----------------+
1 row in set (0.04 sec)

@Tbayer - are the enwiki and dewiki results saved in a different table from the stage0 results? If so, we may be able to delete the most recent test now and relaunch once T175918: EventLogging subscriber module in ready state but not sending tracked events is complete.

Nuria added a comment.Sep 22 2017, 5:32 PM

Just FYI that mysql backend cannot really deal with this volume of data, any subsequente test data will be available through hadoop backend,.

Nuria moved this task from Paused to In Progress on the Analytics-Kanban board.Sep 28 2017, 7:02 PM
Nuria added a comment.EditedSep 28 2017, 7:04 PM

@Tbayer: I imported popups tables to nuria database on hadoop from mysql. Have in mind that popups events are still being sent by (probably) cached code the import only has data up to the duration of your experiment.

see:
popups_16364296
popups_16364296_15423246

Please take a look and let us know if we can move thse tables to archive and delete them from mysql hosts (cc @elukey )

Nuria added a comment.Oct 2 2017, 5:54 PM

@Tbayer Waiting for your confirmation to delete pop up tables from db, disk of slave is full so if we could get that as soon as possible it will be great: https://phabricator.wikimedia.org/T168303

@Tbayer Waiting for your confirmation to delete pop up tables from db, disk of slave is full so if we could get that as soon as possible it will be great: https://phabricator.wikimedia.org/T168303

@Nuria I'm back from vacation and am looking into this now (from the most recent discussion at T168303 I don't know how timely it is at this point, but we should do it anyway).

While doing so, I noticed that the newly imported table nuria.popups_16364296 has a different format from the previously generated one (nuria.popups):

  • The event fields are no longer packed into a struct, but are flattened out (good)
  • The table is no longer partitioned (not so good, but let's see if it's still performant)
  • All event fields are strings now, even those that are numerical in MySQL and were in the previous Hive import too. E.g. perceivedWait should be a bigint. (bad)

Could we fix the latter, to help make sure all MySQL queries can be ported smoothly to Hive? And in general, what's the best place to discuss and debug this new conversion process - should we open a new task? (One would also want to have the useragent field's JSON content accessible as e.g. a map like in the pageview_hourly table, not as a flat string.)

DESCRIBE nuria.popups;

col_name	data_type	comment
event	struct<action:string,hovercardsSuppressedByGadget:boolean,isAnon:boolean,linkInteractionToken:string,namespaceIdHover:bigint,namespaceIdSource:bigint,pageIdSource:bigint,pageTitleHover:string,pageTitleSource:string,pageToken:string,perceivedWait:bigint,popupEnabled:boolean,previewCountBucket:string,previewType:string,sessionToken:string,totalInteractionTime:bigint,editCountBucket:string,api:string,popupDelay:bigint,version:string>	
recvfrom	string	
revision	bigint	
schema	string	
seqid	bigint	
timestamp	bigint	
useragent	string	
uuid	string	
webhost	string	
wiki	string	
year	bigint	
month	bigint	
day	bigint	
hour	bigint	
	NULL	NULL
# Partition Information	NULL	NULL
# col_name            	data_type           	comment             
	NULL	NULL
year	bigint	
month	bigint	
day	bigint	
hour	bigint	
22 rows selected (0.07 seconds)



DESCRIBE nuria.popups_16364296;

col_name	data_type	comment
id	int	
uuid	string	
timestamp	string	
wiki	string	
webhost	string	
useragent	string	
event_action	string	
event_editcountbucket	string	
event_hovercardssuppressedbygadget	string	
event_isanon	string	
event_linkinteractiontoken	string	
event_namespaceidhover	string	
event_namespaceidsource	string	
event_pageidsource	string	
event_pagetitlehover	string	
event_pagetitlesource	string	
event_pagetoken	string	
event_perceivedwait	string	
event_popupdelay	string	
event_popupenabled	string	
event_previewcountbucket	string	
event_previewtype	string	
event_sessiontoken	string	
event_totalinteractiontime	string	
event_version	string	
25 rows selected (0.073 seconds)

Just FYI that mysql backend cannot really deal with this volume of data, any subsequente test data will be available through hadoop backend,.

That should be sufficient in this case (after we're done verifying that all required queries can be ported to Hive, as discussed above). Do you mean that we can record the data directly to Hive, without going through the MySQL (master) table? How would one achieve this when relaunching the test?

Nuria added a comment.Oct 6 2017, 4:49 PM

While doing so, I noticed that the newly imported table nuria.popups_16364296 has a different format from the previously generated one (nuria.popups):

Yes I imported data on that table from Mysql, older data was no longer present on hadoop as it is dropped after 90 days (note that I imported two tables from MySQL). Thus the formatting difference. This import is just a one-off to be clear.

The table is no longer partitioned (not so good, but let's see if it's still performant)

It will be less performant if there is a lot of simultaneous queries due to the way is setup schema-wise (it is an avro-table), size wise this is small for hadoop so that should not be an issue.

All event fields are strings now, even those that are numerical in MySQL and were in the previous Hive import too. E.g. perceivedWait should be a bigint. (bad)

You can cast field while querying, but sure I can try to import again as integer.

And in general, what's the best place to discuss and debug this new conversion process - should we open a new task?

This is not a process, is just a one-time scoop of one table. Eventlogging refine on hadoop will work quite differently and the format of the tables will be struct-based. And partitioned (so hive queries would need to be adjusted accordingly).

(One would also want to have the useragent field's JSON content accessible as e.g. a map like in the pageview_hourly table, not as a flat string.)

This is a copy of what is it is in MYSQL, without any processing. As I mentioned earlier, just a one-off.

Do you mean that we can record the data directly to Hive, without going through the MySQL (master) table? How would one achieve this when relaunching the test?

Yes. Let's please discuss upcoming tests on a new ticket.

Nuria added a comment.Oct 6 2017, 5:48 PM

All event fields are strings now, even those that are numerical in MySQL and were in the previous Hive import too. E.g. perceivedWait should be a bigint. (bad)
Could we fix the latter, to help make sure all MySQL queries can be ported smoothly to Hive?

Note that queries will need to be altered regardless a binary/varbinary type in hive and mysql is a different thing so things do not translate 1 to 1.

Nuria added a comment.Oct 6 2017, 9:24 PM

You can cast field while querying, but sure I can try to import again as integer.

Sorry, my mistake bigint types are not supported by sqoop (documentation makes you think that they might be but does not seem to be the case) so best choice is to have them as strings on tables but you can cast them to another type when selecting, will that work?

Nuria added a comment.Oct 10 2017, 6:06 PM

@Tbayer: in the light of your new tests, do we want to keep this old data or not?

All event fields are strings now, even those that are numerical in MySQL and were in the previous Hive import too. E.g. perceivedWait should be a bigint. (bad)
Could we fix the latter, to help make sure all MySQL queries can be ported smoothly to Hive?

Note that queries will need to be altered regardless a binary/varbinary type in hive and mysql is a different thing so things do not translate 1 to 1.

I think we'll be fine with the varbinary fields from MySQL being strings in Hive now. (The maximum lengths are different, apparently. But that should not matter in practice for this schema.)

You can cast field while querying, but sure I can try to import again as integer.

Sorry, my mistake bigint types are not supported by sqoop (documentation makes you think that they might be but does not seem to be the case) so best choice is to have them as strings on tables but you can cast them to another type when selecting, will that work?

Hm, it would be really great to avoid having to do that every time... It seems that preserving bigint and boolean types worked for the first import?

DESCRIBE nuria.popups;

col_name	data_type	comment
event	struct<action:string,hovercardsSuppressedByGadget:boolean,isAnon:boolean,linkInteractionToken:string,namespaceIdHover:bigint,namespaceIdSource:bigint,pageIdSource:bigint,pageTitleHover:string,pageTitleSource:string,pageToken:string,perceivedWait:bigint,popupEnabled:boolean,previewCountBucket:string,previewType:string,sessionToken:string,totalInteractionTime:bigint,editCountBucket:string,api:string,popupDelay:bigint,version:string>	
...
Nuria added a comment.Oct 10 2017, 7:01 PM

Hm, it would be really great to avoid having to do that every time... It seems that preserving bigint and boolean types worked for the first import?

That was not an import @Tbayer , that was data refined from hadopp. As i meantioned before since you wanted all data for this schema and some of it was no longer in hadoop I imported ALL data available in mysql, some of it from early 2017

@Tbayer: in the light of your new tests, do we want to keep this old data or not?

Yes, for example because ..._16364296 table also contains the data for the previous tests in other languages.

Nuria added a comment.Oct 10 2017, 7:26 PM

Ok, let us know then if we can delete the data from mysql in the light of having the data imported onto hadoop.

Nuria added a comment.Oct 16 2017, 6:25 PM

Ping @Tbayer , does my explanation make sense? The struct-shaped columns will come with automgically refined eventlogging data, scopping from MySQL leaves data with column format more similar to MySQL.

I understand that types are not alike but I think this is the best scooping we can do for now, can we keep data on hadoop (we can move it to archive db) and delete the one on mysql for this two tables?

Tbayer added a comment.EditedOct 16 2017, 7:09 PM

Ping @Tbayer , does my explanation make sense? The struct-shaped columns will come with automgically refined eventlogging data, scopping from MySQL leaves data with column format more similar to MySQL.

Yes, I think I understand the difference. (I would be fine with calling the process of making the data available in Hive "import" too, but I can see why you prefer to call it "refining". BTW, for cross-reference, I understand that T177783 refers to this process.)

I understand that types are not alike but I think this is the best scooping we can do for now, can we keep data on hadoop (we can move it to archive db) and delete the one on mysql for this two tables?

I guess yes, but let me please do one more check (considering that this was a somewhat unexpected complication and I'm not totally certain we have already evaluated all possible ramification of converting these fields to strings and back, plus we presumably won't have the ability to restore the MySQL tables once they are dropped). For the record, I already reproduced one MySQL query [1] successfully last week on the scooped Hive table [2], verifying that the results are identical.

For now, I have mentioned this limitation in the documentation - please check that I got it correct.

[1] T158172#3625184

[2]

SELECT wiki,
cards_1500viewed_on_page,
COUNT(*) AS frequency
FROM (
  SELECT wiki,
  event_pageToken,
  COUNT(DISTINCT event_linkInteractionToken) AS cards_1500viewed_on_page
  FROM nuria.Popups_16364296
  WHERE
  (event_action = 'pageLoaded'
  OR (CAST(event_totalInteractionTime AS BIGINT) >= CAST(event_perceivedWait AS BIGINT) + 1500 ))
  AND wiki IN ('huwiki', 'itwiki', 'ruwiki', 'cawiki', 'elwiki', 'hewiki')
  AND event_isAnon = 1
  AND event_popupEnabled = 1
  AND SUBSTR(timestamp,0,8) >= '20170725'
  AND SUBSTR(timestamp,0,8) < '20170822'
  GROUP BY wiki, event_popupEnabled, event_pageToken) AS pageviewlist
GROUP BY wiki, cards_1500viewed_on_page
ORDER BY wiki, cards_1500viewed_on_page LIMIT 10000;
Nuria added a comment.Oct 17 2017, 8:02 PM

Docs look good, thanks.

Nuria moved this task from In Progress to Paused on the Analytics-Kanban board.Nov 7 2017, 4:07 PM
elukey closed this task as Resolved.Nov 23 2017, 11:06 AM