Page MenuHomePhabricator

Identical EventLogging queries give different results on db1047 and dbstore1002
Closed, ResolvedPublic

Description

For several different EventLogging tables, the same query yields quite different results depending on whether it is run on db1047 (= "s1-analytics-slave") or on dbstore1002 ("analytics-store" = "s2-analytics-slave" etc.).

Example 1 (about MobileWebSectionUsage_15038458, queries done within a few hours on Saturday/Sunday March 26/27):
see P2832

Example 2: (both queries ran just now, and earlier on March 16 with the same results)

mysql:research@analytics-store.eqiad.wmnet [(none)]> SELECT COUNT(*) AS events FROM log.NavigationTiming_14899847 WHERE LEFT(timestamp, 8) = '20151203';
+--------+
| events |
+--------+
| 307440 |
+--------+
1 row in set (5.84 sec)

mysql:research@s1-analytics-slave.eqiad.wmnet [(none)]> SELECT COUNT(*) AS events FROM log.NavigationTiming_14899847 WHERE LEFT(timestamp, 8) = '20151203';
+--------+
| events |
+--------+
| 358908 |
+--------+
1 row in set (2 min 15.64 sec)

Example 3: Many tables seem to differ considerably in size between store and s1, e.g. these (queried just now):

mysql:research@s1-analytics-slave.eqiad.wmnet [(none)]>  SELECT table_name, (DATA_LENGTH + INDEX_LENGTH)/1024/1024/1024 as `TOTAL SIZE (GB)` FROM information_schema.tables WHERE table_name LIKE 'MobileWikiAppToC%';
+--------------------------------------+------------------+
| table_name                           | TOTAL SIZE (GB)  |
+--------------------------------------+------------------+
| MobileWikiAppToCInteraction_10375484 | 150.709785191342 |
| MobileWikiAppToCInteraction_11014396 |   0.988343123347 |
| MobileWikiAppToCInteraction_14585319 |   0.452323913574 |
| MobileWikiAppToCInteraction_8461467  |  81.039179009385 |
+--------------------------------------+------------------+
4 rows in set (0.04 sec)

mysql:research@analytics-store.eqiad.wmnet [(none)]> SELECT table_name, (DATA_LENGTH + INDEX_LENGTH)/1024/1024/1024 as `TOTAL SIZE (GB)` FROM information_schema.tables WHERE table_name LIKE 'MobileWikiAppToC%';
+--------------------------------------+------------------+
| table_name                           | TOTAL SIZE (GB)  |
+--------------------------------------+------------------+
| MobileWikiAppToCInteraction_10375484 | 133.469544879161 |
| MobileWikiAppToCInteraction_11014396 |   0.876318826340 |
| MobileWikiAppToCInteraction_14585319 |   0.465894431807 |
| MobileWikiAppToCInteraction_8461467  |  81.233877775259 |
+--------------------------------------+------------------+
4 rows in set (0.37 sec)

I already discussed such issues briefly with @madhuvishy and @Nuria on #wikimedia-analytics two weeks ago; back then I was informed that (despite the "slave" in the name) s1 might be more reliable than on analytics-store.

(BTW, is the information here still current? https://wikitech.wikimedia.org/wiki/MariaDB#Analytics_Slaves )

Event Timeline

FYI that @jcrespo send an e-mail on this regard yesterday to @analytics and db1047 is not fitted for usage as of now.

FYI that @jcrespo send an e-mail on this regard yesterday to @analytics

This is a different issue. I had actually already seen that email (as you could have inferred from the fact that it was a reply to a question I had posted in the same thread shortly before).

and db1047 is not fitted for usage as of now.

Actually it is, according to Jaime. I had asked him about this on IRC (see partial log below) before filing this bug on his request.

[08:46:36] <HaeB>	 jynus: followup question to https://lists.wikimedia.org/pipermail/analytics/2016-March/005072.html : is it OK to restart queries on s1-analytics-slave now? (it seems to be up and running)
[...]
[08:56:51] <jynus>	 HaeB, yes, queries can go there
[08:57:11] <HaeB>	 jynus: how long might the refilling take?
[08:57:48] <jynus>	 HaeB, the tables are constantly being refilled from the master, the only thing that changes is how late they are from the master
[08:58:24] <HaeB>	 so it's just about the update lag, basically?
[08:58:25] <HaeB>	 i see
[...]
[08:59:38] <jynus>	 most seem to be 5 days behind
[08:59:48] <jynus>	 enwiki and dewiki only minutes
[09:00:17] <HaeB>	 (i'm querying older eventlogging data, so that's not a concern in my case)
[...]
[09:05:08] <jynus>	 right now dbstore1002 is probably more accurante than db1047
[...]
[09:07:37] <jynus>	 except for MobileAppUploadAttempts_5257716
[09:08:37] <HaeB>	 jynus: also for older data? (from around the beginning of january, say)
[09:09:16] <jynus>	 yes, in fact I can say that (except for that table) all rows from January to now are there
[09:09:19] <HaeB>	 jynus: and why? (the way madhu and nuria explained it to me made it sound that it was the other way around... that s1-analytics might be more reliable than store)
[09:09:34] <jynus>	 well, eventually, both will be
[...]
[09:10:23] <jynus>	 but db1047 just had issues and I have not yet completed a full resync
[...]
[09:11:10] <HaeB>	 jynus: ok, but that would not affect (say) eventlogging data from january 7, say?
[09:11:50] <HaeB>	 (the missing resync, i mean)
[09:11:58] <jynus>	 I am actually checking since 2016-01-01
[09:13:15] <jynus>	 I think the best way to check it is for you to run a quick count(*) from the timestamp that interests you an compare results
[09:13:33] <jynus>	 if you see a difference, tell me

@jcrespo Any progress here? (I just reran Example 2 and it still gives the same results with the same discrepancy.)

I will give it a look tomorrow.

jcrespo triaged this task as Medium priority.May 25 2016, 5:47 PM
jcrespo moved this task from Triage to Pending comment on the DBA board.

See also T136057#2328652 (@Dbrant encountering a "wrong checksum" error on s1-analytics-slave while the same query works on s2)

^ I had the exact same problem yesterday. Thanks for documenting this, @Tbayer!

From what I can see, results are >3 months. By that time, if results arrive, results may have been already purged due to retention policies, resulting on partial results. Why results so old keep being inserted is something I do not know- but as far as I can see, it happens.

jcrespo moved this task from Pending comment to In progress on the DBA board.

BTW, do never do:

MariaDB EVENTLOGGING m4 localhost log > SELECT COUNT(*) AS events FROM log.NavigationTiming_14899847 WHERE LEFT(timestamp, 8) = '20151203';
+--------+
| events |
+--------+
| 358908 |
+--------+
1 row in set (4.69 sec)

I literally have that as an example on may talks on what not to do: http://www.slideshare.net/jynus/query-optimization-from-0-to-10-and-up-to-57/44 . Please give a look at those slides. Compare the speed with:

MariaDB EVENTLOGGING m4 localhost log > SELECT COUNT(*) AS events FROM log.NavigationTiming_14899847 WHERE timestamp like '20151203%';
+--------+
| events |
+--------+
| 358908 |
+--------+
1 row in set (0.46 sec)

This is a query done on the master. It seems analytics-store is missing those particular rows. Blame T125135 for some rows skipped (someone decided to get rid of auto-increments, and now new rows are inserted in parallel, so for a time not all rows were inserted).

I am resyncing everything from December now, that will fix this particular issue.

Thank you, the code on our end is reayd to add an auto increment column. Please see: https://phabricator.wikimedia.org/T125135

MariaDB  db1047 log > SELECT COUNT(*) AS events FROM log.NavigationTiming_14899847 WHERE timestamp like '20151203%';
+--------+
| events |
+--------+
| 358908 |
+--------+
1 row in set (1.40 sec)

MariaDB  db1046 log > SELECT COUNT(*) AS events FROM log.NavigationTiming_14899847 WHERE timestamp like '20151203%';
+--------+
| events |
+--------+
| 358908 |
+--------+
1 row in set (0.52 sec)

MariaDB  dbstore1002 log > SELECT COUNT(*) AS events FROM log.NavigationTiming_14899847 WHERE timestamp like '20151203%';
+--------+
| events |
+--------+
| 358908 |
+--------+
1 row in set (0.61 sec)

This has been fixed for all tables up to December.